Towards the Fastest Brainf*** Implementation Ever

Towards the Fastest Brainf*** Implementation Ever

22 Oct 2024

In my last post I described how I made a very fast BF interpreter. Well, there’s a lot more speed to be had with an optimizing compiler. This post is a write-up of my assignment for a compilers class, so the post a little rougher than normal.

You can find the code at the following places:

Code organization #

Files for the compiler:

interp_threaded_opt.rkt
BF parser and all optimizations live here.
native.rkt
Takes the optimized AST from interp_threaded_opt.rkt and emits native ARM instructions for my M1 Pro chip.

interp_threaded_opt.rkt #

The first role of this module is to define an AST for BF. My AST is represented as Racket structs; these are the nodes that I have:

Node Fields Comment
loop body Loops hold their entire body within them. There is a parsing pass to the code that matches all [ ] characters.
add amount Sequences of + or - get smooshed together into single add instructions.
shift amount Like add, but for > and <
set-cell value Set the current cell to a specific value. Used to optimize [-] and the like.
bf-write
bf-read
add-cell-0 dest Zero current cell; add old value to another cell. This is a common loop that I recognized and optimized.
mult-block-0 body Result of optimizing simple loops. Body is a map from tape offset → value.
search-0 stride Representation of scan loops like [<<<<].

The rest of the module is devoted to optimizations performed on the AST. Initially, only loop, add, shift, bf-write, and bf-read cells are present, but through the optimization passes more get added. See § Optimizations for details on all the optimization passes.

There is a compile function, but this just compiles the optimized AST down to Racket closures with a threaded interpreter.

native.rkt #

The main entry point is the compile-to-asm function, which emits some fixed prelude code, the body of the program, and then some fixed postlude code.

There are a bunch of functions prefixed with i/; these are to help me generate assembly instructions with a particular format. Mostly just wrappers around calls to format.

The emit-asm function does the hard work of walking the AST and generating instructions for each node. There’s a loose set of conventions around various registers; the most important is that x20 always holds the address of the start of the tape, and x21 holds the current pointer. I never use these registers for anything else; you can always access the current value under the program head with [x20, x21].

There are two special functions for emitting the assembly for search-0 nodes: one for forward scans ([>]) and one for backward scans ([<]). This is just done for organizational convenience.

Optimizations #

The entry point to optimizing the BF AST is in the optimize function in interp_threaded_opt.rkt. This chains optimizations together; each optimization function must take a whole program AST and return a whole program AST functionally. The optimizations are in order they are performed:

  1. combine-instrs

    This pass takes sequences of instructions like +++, initially represented with (list (add 1) (add 1) (add 1)) and turns it into (add 3). Same thing for pointer shift instructions.

  2. opt/useless

    This cleans up instructions like (shift 0) or (add 0) that have no effect.

  3. opt/add

    This recognizes patterns like [>+<-], which adds the value of one cell to another. It handles arbitrary shift and add amounts, as long as they’re balanced. Kind of like a baby opt/basic-loop pass.

  4. opt/zero-out

    Handles cases of [-] or [+] to set the cell to 0.

  5. opt/basic-loop

    Does the basic loop optimization we discussed in class. Easily the most complicated optimization. It abstractly evaluates the body of a loop; it sets the initial pointer to 0 and tracks what additions go to which offsets. Bails out if anything gets too complicated.

  6. opt/zero-add->set

    Optimizes cases of [-]+++ to just set the current cell value to 3.

  7. opt/0-scan

    Scan loops, as discussed in class.

Evaluation #

Extensive tests (bfcheck.pl) #

I precompiled and ran all the files in the check/ folder provided. I noticed that the first run of the resulting binaries took a long time, and that they all ran almost instantly afterwards. I noticed that there was a bunch of network activity; seems like macOS was being zealous in sending signatures for each of the binaries the first time they were run. The benchmarks run quickly, so this added up. The numbers reported here are after running the tests one time to avoid the network penalty.

Optimization set Total time (seconds)
No optimizations 0.763064
Basic loops 0.767731
Scan loops 0.790222
Scan & basic loops 0.763272

There’s not a whole lot of variance.

Benchmarks (mandel.b, etc.) #

I ran my compiler on some of the benchmarks from the BF benchmarks suite we’ve been using. I also tried out a few with the cgbfi.b BF-interpreter-in-BF.

Table 1: Execution times for two benchmarks, natively compiled. All times in seconds.
Benchmark No opts Basic loops Scan loops Scan & basic loops
hanoi.b 0.11 0.05 0.11 0.05
mandel.b 0.73 0.49 0.73 0.49

Seems that basic loop optimization is most important for the Hanoi and Mandelbrot benchmarks. I also tried the Sierpinski triangle benchmark but it never ran long enough for me to notice any difference.

Contrast with benchmarks running under the cgbfi.b interpreter: scan loops are far and away the most important optimization:

Table 2: Execution times for benchmarks running under the cgbfi.b interpreter. All times in seconds. *Gave up after 2 hours; estimated duration recorded. †No attempt.
Benchmark No opts Basic loops Scan loops Scan & basic loops
serptri.b 0.07 0.06 0.03 0.02
mandel.b *44,052.00 7,818.96

Example optimizations #

Basic loops #

This code:

+++[->+>-<<]

Gets optimized to this:

ldrsb   w11, [x20, x21] ; add 3
add     w11, w11, #3
strb    w11, [x20, x21]
add     x22, x20, x21   ; mult-block-0
ldrsb   x23, [x22]
add     x24, x22, #1
ldrsb   x25, [x24]
mov     x11, x23
add     w11, w25, w11
strb    w11, [x24]
add     x24, x22, #2
ldrsb   x25, [x24]
mov     x11, x23
subs    w11, w25, w11
strb    w11, [x24]
strb    wzr, [x20, x21]

Any loop with I/O or nested loops—even something like setting something to 0 are not covered by this.

Scan loops #

This code:

[>>>>]

Gets optimized to this:

    adrp      x22, lIDX@PAGE              ; search-0
    ldr       q0, [x22, lIDX@PAGEOFF]     ; v0 = idx vector
    adrp      x22, lSTRIDE4@PAGE
    ldr       q3, [x22, lSTRIDE4@PAGEOFF] ; v3 = stride mask
    movi.2d   v1, #0                      ; v1 = zero vect
    subs      x21, x21, #16
LBB0_0:
    add       x21, x21, #16
    add       x22, x20, x21
    ld1       {v2.16b}, [x22]             ; v2 = tape
    cmeq.16b  v4, v2, v1                  ; v4 = tape == zeros
    and.16b   v4, v4, v3                  ; mask with stride
    orn.16b   v4, v0, v4                  ; idx or !(tape == zeros)
    uminv.16b b5, v4                      ; find smallest idx
    umov      w22, v5.b[0]
    subs      w11, w22, #255
    beq       LBB0_0
    add       x21, x21, x22

The first few lines load up the 0 and stride mask vectors.

The only strides I optimize are 1, -1, 2, -2, 4, -4, 8, and -8. At stride 16, you’re only getting one check per loop; might as well just do the basic iteration instead of firing up the vector unit.

Mastodon