Towards the Fastest Brainf*** Implementation Ever

22 Oct 2024

In my last post I described how I made a very fast BF interpreter. Well, there’s a lot more speed to be had with an optimizing compiler. This post is a write-up of my assignment for a compilers class, so the post a little rougher than normal.

You can find the code at the following places:

Code organization #

Files for the compiler:

interp_threaded_opt.rkt: BF parser and all optimizations live here.
native.rkt: Takes the optimized AST from interp_threaded_opt.rkt and emits native ARM instructions for my M1 Pro chip.

`interp_threaded_opt.rkt` #

The first role of this module is to define an AST for BF. My AST is represented as Racket structs; these are the nodes that I have:

Node	Fields	Comment
`loop`	`body`	Loops hold their entire body within them. There is a parsing pass to the code that matches all `[` `]` characters.
`add`	`amount`	Sequences of `+` or `-` get smooshed together into single add instructions.
`shift`	`amount`	Like `add`, but for `>` and `<`
`set-cell`	`value`	Set the current cell to a specific value. Used to optimize `[-]` and the like.
`bf-write`
`bf-read`
`add-cell-0`	`dest`	Zero current cell; add old value to another cell. This is a common loop that I recognized and optimized.
`mult-block-0`	`body`	Result of optimizing simple loops. Body is a map from tape offset → value.
`search-0`	`stride`	Representation of scan loops like `[<<<<]`.

The rest of the module is devoted to optimizations performed on the AST. Initially, only loop, add, shift, bf-write, and bf-read cells are present, but through the optimization passes more get added. See § Optimizations for details on all the optimization passes.

There is a compile function, but this just compiles the optimized AST down to Racket closures with a threaded interpreter.

`native.rkt` #

The main entry point is the compile-to-asm function, which emits some fixed prelude code, the body of the program, and then some fixed postlude code.

There are a bunch of functions prefixed with i/; these are to help me generate assembly instructions with a particular format. Mostly just wrappers around calls to format.

The emit-asm function does the hard work of walking the AST and generating instructions for each node. There’s a loose set of conventions around various registers; the most important is that x20 always holds the address of the start of the tape, and x21 holds the current pointer. I never use these registers for anything else; you can always access the current value under the program head with [x20, x21].

There are two special functions for emitting the assembly for search-0 nodes: one for forward scans ([>]) and one for backward scans ([<]). This is just done for organizational convenience.

Optimizations #

The entry point to optimizing the BF AST is in the optimize function in interp_threaded_opt.rkt. This chains optimizations together; each optimization function must take a whole program AST and return a whole program AST functionally. The optimizations are in order they are performed:

combine-instrs

This pass takes sequences of instructions like +++, initially represented with (list (add 1) (add 1) (add 1)) and turns it into (add 3). Same thing for pointer shift instructions.
opt/useless

This cleans up instructions like (shift 0) or (add 0) that have no effect.
opt/add

This recognizes patterns like [>+<-], which adds the value of one cell to another. It handles arbitrary shift and add amounts, as long as they’re balanced. Kind of like a baby opt/basic-loop pass.
opt/zero-out

Handles cases of [-] or [+] to set the cell to 0.
opt/basic-loop

Does the basic loop optimization we discussed in class. Easily the most complicated optimization. It abstractly evaluates the body of a loop; it sets the initial pointer to 0 and tracks what additions go to which offsets. Bails out if anything gets too complicated.
opt/zero-add->set

Optimizes cases of [-]+++ to just set the current cell value to 3.
opt/0-scan

Scan loops, as discussed in class.

Evaluation #

Extensive tests (`bfcheck.pl`) #

I precompiled and ran all the files in the check/ folder provided. I noticed that the first run of the resulting binaries took a long time, and that they all ran almost instantly afterwards. I noticed that there was a bunch of network activity; seems like macOS was being zealous in sending signatures for each of the binaries the first time they were run. The benchmarks run quickly, so this added up. The numbers reported here are after running the tests one time to avoid the network penalty.

Optimization set	Total time (seconds)
No optimizations	0.763064
Basic loops	0.767731
Scan loops	0.790222
Scan & basic loops	0.763272

There’s not a whole lot of variance.

Benchmarks (`mandel.b`, etc.) #

I ran my compiler on some of the benchmarks from the BF benchmarks suite we’ve been using. I also tried out a few with the cgbfi.b BF-interpreter-in-BF.

Table 1: Execution times for two benchmarks, natively compiled. All times in seconds.

Benchmark	No opts	Basic loops	Scan loops	Scan & basic loops
`hanoi.b`	0.11	0.05	0.11	0.05
`mandel.b`	0.73	0.49	0.73	0.49

Seems that basic loop optimization is most important for the Hanoi and Mandelbrot benchmarks. I also tried the Sierpinski triangle benchmark but it never ran long enough for me to notice any difference.

Contrast with benchmarks running under the cgbfi.b interpreter: scan loops are far and away the most important optimization:

Table 2: Execution times for benchmarks running under the cgbfi.b interpreter. All times in seconds. *Gave up after 2 hours; estimated duration recorded. †No attempt.

Benchmark	No opts	Basic loops	Scan loops	Scan & basic loops
`serptri.b`	0.07	0.06	0.03	0.02
`mandel.b`	*44,052.00	†	†	7,818.96

Example optimizations #

Basic loops #

This code:

+++[->+>-<<]

Gets optimized to this:

ldrsb   w11, [x20, x21] ; add 3
add     w11, w11, #3
strb    w11, [x20, x21]
add     x22, x20, x21   ; mult-block-0
ldrsb   x23, [x22]
add     x24, x22, #1
ldrsb   x25, [x24]
mov     x11, x23
add     w11, w25, w11
strb    w11, [x24]
add     x24, x22, #2
ldrsb   x25, [x24]
mov     x11, x23
subs    w11, w25, w11
strb    w11, [x24]
strb    wzr, [x20, x21]

Any loop with I/O or nested loops—even something like setting something to 0 are not covered by this.

Scan loops #