Towards the Fastest Brainf*** Implementation Ever
22 Oct 2024
In my last post I described how I made a very fast BF interpreter. Well, there’s a lot more speed to be had with an optimizing compiler. This post is a write-up of my assignment for a compilers class, so the post a little rougher than normal.
You can find the code at the following places:
Code organization #
Files for the compiler:
- interp_threaded_opt.rkt
- BF parser and all optimizations live here.
- native.rkt
- Takes the optimized AST from interp_threaded_opt.rktand emits native ARM instructions for my M1 Pro chip.
  interp_threaded_opt.rkt
  #
The first role of this module is to define an AST for BF. My AST is represented as Racket structs; these are the nodes that I have:
| Node | Fields | Comment | 
|---|---|---|
| loop | body | Loops hold their entire body within them. There is a parsing pass to the code that matches all []characters. | 
| add | amount | Sequences of +or-get smooshed together into single add instructions. | 
| shift | amount | Like add, but for>and< | 
| set-cell | value | Set the current cell to a specific value. Used to optimize [-]and the like. | 
| bf-write | ||
| bf-read | ||
| add-cell-0 | dest | Zero current cell; add old value to another cell. This is a common loop that I recognized and optimized. | 
| mult-block-0 | body | Result of optimizing simple loops. Body is a map from tape offset → value. | 
| search-0 | stride | Representation of scan loops like [<<<<]. | 
The rest of the module is devoted to optimizations performed on the AST. Initially, only loop, add, shift, bf-write, and bf-read cells are present, but through the optimization passes more get added. See § Optimizations for details on all the optimization passes.
There is a compile function, but this just compiles the optimized AST down to Racket closures with a threaded interpreter.
  native.rkt
  #
The main entry point is the compile-to-asm function, which emits some fixed prelude code, the body of the program, and then some fixed postlude code.
There are a bunch of functions prefixed with i/; these are to help me generate assembly instructions with a particular format. Mostly just wrappers around calls to format.
The emit-asm function does the hard work of walking the AST and generating instructions for each node. There’s a loose set of conventions around various registers; the most important is that x20 always holds the address of the start of the tape, and x21 holds the current pointer. I never use these registers for anything else; you can always access the current value under the program head with [x20, x21].
There are two special functions for emitting the assembly for search-0 nodes: one for forward scans ([>]) and one for backward scans ([<]). This is just done for organizational convenience.
Optimizations #
The entry point to optimizing the BF AST is in the optimize function in interp_threaded_opt.rkt. This chains optimizations together; each optimization function must take a whole program AST and return a whole program AST functionally. The optimizations are in order they are performed:
- 
combine-instrsThis pass takes sequences of instructions like +++, initially represented with(list (add 1) (add 1) (add 1))and turns it into(add 3). Same thing for pointer shift instructions.
- 
opt/uselessThis cleans up instructions like (shift 0)or(add 0)that have no effect.
- 
opt/addThis recognizes patterns like [>+<-], which adds the value of one cell to another. It handles arbitrary shift and add amounts, as long as they’re balanced. Kind of like a babyopt/basic-looppass.
- 
opt/zero-outHandles cases of [-]or[+]to set the cell to 0.
- 
opt/basic-loopDoes the basic loop optimization we discussed in class. Easily the most complicated optimization. It abstractly evaluates the body of a loop; it sets the initial pointer to 0and tracks what additions go to which offsets. Bails out if anything gets too complicated.
- 
opt/zero-add->setOptimizes cases of [-]+++to just set the current cell value to 3.
- 
opt/0-scanScan loops, as discussed in class. 
Evaluation #
  Extensive tests (bfcheck.pl)
  #
I precompiled and ran all the files in the check/ folder provided. I noticed that the first run of the resulting binaries took a long time, and that they all ran almost instantly afterwards. I noticed that there was a bunch of network activity; seems like macOS was being zealous in sending signatures for each of the binaries the first time they were run. The benchmarks run quickly, so this added up. The numbers reported here are after running the tests one time to avoid the network penalty.
| Optimization set | Total time (seconds) | 
|---|---|
| No optimizations | 0.763064 | 
| Basic loops | 0.767731 | 
| Scan loops | 0.790222 | 
| Scan & basic loops | 0.763272 | 
There’s not a whole lot of variance.
  Benchmarks (mandel.b, etc.)
  #
I ran my compiler on some of the benchmarks from the BF benchmarks suite we’ve been using. I also tried out a few with the cgbfi.b BF-interpreter-in-BF.
| Benchmark | No opts | Basic loops | Scan loops | Scan & basic loops | 
|---|---|---|---|---|
| hanoi.b | 0.11 | 0.05 | 0.11 | 0.05 | 
| mandel.b | 0.73 | 0.49 | 0.73 | 0.49 | 
Seems that basic loop optimization is most important for the Hanoi and Mandelbrot benchmarks. I also tried the Sierpinski triangle benchmark but it never ran long enough for me to notice any difference.
Contrast with benchmarks running under the cgbfi.b interpreter: scan loops are far and away the most important optimization:
cgbfi.b interpreter. All times in seconds. *Gave up after 2 hours; estimated duration recorded. †No attempt.
| Benchmark | No opts | Basic loops | Scan loops | Scan & basic loops | 
|---|---|---|---|---|
| serptri.b | 0.07 | 0.06 | 0.03 | 0.02 | 
| mandel.b | *44,052.00 | † | † | 7,818.96 | 
Example optimizations #
Basic loops #
This code:
+++[->+>-<<]
Gets optimized to this:
ldrsb   w11, [x20, x21] ; add 3
add     w11, w11, #3
strb    w11, [x20, x21]
add     x22, x20, x21   ; mult-block-0
ldrsb   x23, [x22]
add     x24, x22, #1
ldrsb   x25, [x24]
mov     x11, x23
add     w11, w25, w11
strb    w11, [x24]
add     x24, x22, #2
ldrsb   x25, [x24]
mov     x11, x23
subs    w11, w25, w11
strb    w11, [x24]
strb    wzr, [x20, x21]
Any loop with I/O or nested loops—even something like setting something to 0 are not covered by this.
Scan loops #
This code:
[>>>>]
Gets optimized to this:
    adrp      x22, lIDX@PAGE              ; search-0
    ldr       q0, [x22, lIDX@PAGEOFF]     ; v0 = idx vector
    adrp      x22, lSTRIDE4@PAGE
    ldr       q3, [x22, lSTRIDE4@PAGEOFF] ; v3 = stride mask
    movi.2d   v1, #0                      ; v1 = zero vect
    subs      x21, x21, #16
LBB0_0:
    add       x21, x21, #16
    add       x22, x20, x21
    ld1       {v2.16b}, [x22]             ; v2 = tape
    cmeq.16b  v4, v2, v1                  ; v4 = tape == zeros
    and.16b   v4, v4, v3                  ; mask with stride
    orn.16b   v4, v0, v4                  ; idx or !(tape == zeros)
    uminv.16b b5, v4                      ; find smallest idx
    umov      w22, v5.b[0]
    subs      w11, w22, #255
    beq       LBB0_0
    add       x21, x21, x22
The first few lines load up the 0 and stride mask vectors.
The only strides I optimize are 1, -1, 2, -2, 4, -4, 8, and -8. At stride 16, you’re only getting one check per loop; might as well just do the basic iteration instead of firing up the vector unit.