Towards the Fastest Brainf*** Implementation Ever
22 Oct 2024
In my last post I described how I made a very fast BF interpreter. Well, there’s a lot more speed to be had with an optimizing compiler. This post is a write-up of my assignment for a compilers class, so the post a little rougher than normal.
You can find the code at the following places:
Code organization #
Files for the compiler:
interp_threaded_opt.rkt
- BF parser and all optimizations live here.
native.rkt
- Takes the optimized AST from
interp_threaded_opt.rkt
and emits native ARM instructions for my M1 Pro chip.
interp_threaded_opt.rkt
#
The first role of this module is to define an AST for BF. My AST is represented as Racket structs; these are the nodes that I have:
Node | Fields | Comment |
---|---|---|
loop |
body |
Loops hold their entire body within them. There is a parsing pass to the code that matches all [ ] characters. |
add |
amount |
Sequences of + or - get smooshed together into single add instructions. |
shift |
amount |
Like add , but for > and < |
set-cell |
value |
Set the current cell to a specific value. Used to optimize [-] and the like. |
bf-write |
||
bf-read |
||
add-cell-0 |
dest |
Zero current cell; add old value to another cell. This is a common loop that I recognized and optimized. |
mult-block-0 |
body |
Result of optimizing simple loops. Body is a map from tape offset → value. |
search-0 |
stride |
Representation of scan loops like [<<<<] . |
The rest of the module is devoted to optimizations performed on the AST. Initially, only loop
, add
, shift
, bf-write
, and bf-read
cells are present, but through the optimization passes more get added. See § Optimizations for details on all the optimization passes.
There is a compile
function, but this just compiles the optimized AST down to Racket closures with a threaded interpreter.
native.rkt
#
The main entry point is the compile-to-asm
function, which emits some fixed prelude code, the body of the program, and then some fixed postlude code.
There are a bunch of functions prefixed with i/
; these are to help me generate assembly instructions with a particular format. Mostly just wrappers around calls to format
.
The emit-asm
function does the hard work of walking the AST and generating instructions for each node. There’s a loose set of conventions around various registers; the most important is that x20
always holds the address of the start of the tape, and x21
holds the current pointer. I never use these registers for anything else; you can always access the current value under the program head with [x20, x21]
.
There are two special functions for emitting the assembly for search-0
nodes: one for forward scans ([>]
) and one for backward scans ([<]
). This is just done for organizational convenience.
Optimizations #
The entry point to optimizing the BF AST is in the optimize
function in interp_threaded_opt.rkt
. This chains optimizations together; each optimization function must take a whole program AST and return a whole program AST functionally. The optimizations are in order they are performed:
-
combine-instrs
This pass takes sequences of instructions like
+++
, initially represented with(list (add 1) (add 1) (add 1))
and turns it into(add 3)
. Same thing for pointer shift instructions. -
opt/useless
This cleans up instructions like
(shift 0)
or(add 0)
that have no effect. -
opt/add
This recognizes patterns like
[>+<-]
, which adds the value of one cell to another. It handles arbitrary shift and add amounts, as long as they’re balanced. Kind of like a babyopt/basic-loop
pass. -
opt/zero-out
Handles cases of
[-]
or[+]
to set the cell to 0. -
opt/basic-loop
Does the basic loop optimization we discussed in class. Easily the most complicated optimization. It abstractly evaluates the body of a loop; it sets the initial pointer to
0
and tracks what additions go to which offsets. Bails out if anything gets too complicated. -
opt/zero-add->set
Optimizes cases of
[-]+++
to just set the current cell value to 3. -
opt/0-scan
Scan loops, as discussed in class.
Evaluation #
Extensive tests (bfcheck.pl
)
#
I precompiled and ran all the files in the check/
folder provided. I noticed that the first run of the resulting binaries took a long time, and that they all ran almost instantly afterwards. I noticed that there was a bunch of network activity; seems like macOS was being zealous in sending signatures for each of the binaries the first time they were run. The benchmarks run quickly, so this added up. The numbers reported here are after running the tests one time to avoid the network penalty.
Optimization set | Total time (seconds) |
---|---|
No optimizations | 0.763064 |
Basic loops | 0.767731 |
Scan loops | 0.790222 |
Scan & basic loops | 0.763272 |
There’s not a whole lot of variance.
Benchmarks (mandel.b
, etc.)
#
I ran my compiler on some of the benchmarks from the BF benchmarks suite we’ve been using. I also tried out a few with the cgbfi.b
BF-interpreter-in-BF.
Benchmark | No opts | Basic loops | Scan loops | Scan & basic loops |
---|---|---|---|---|
hanoi.b |
0.11 | 0.05 | 0.11 | 0.05 |
mandel.b |
0.73 | 0.49 | 0.73 | 0.49 |
Seems that basic loop optimization is most important for the Hanoi and Mandelbrot benchmarks. I also tried the Sierpinski triangle benchmark but it never ran long enough for me to notice any difference.
Contrast with benchmarks running under the cgbfi.b
interpreter: scan loops are far and away the most important optimization:
Benchmark | No opts | Basic loops | Scan loops | Scan & basic loops |
---|---|---|---|---|
serptri.b |
0.07 | 0.06 | 0.03 | 0.02 |
mandel.b |
*44,052.00 | † | † | 7,818.96 |
Example optimizations #
Basic loops #
This code:
+++[->+>-<<]
Gets optimized to this:
ldrsb w11, [x20, x21] ; add 3
add w11, w11, #3
strb w11, [x20, x21]
add x22, x20, x21 ; mult-block-0
ldrsb x23, [x22]
add x24, x22, #1
ldrsb x25, [x24]
mov x11, x23
add w11, w25, w11
strb w11, [x24]
add x24, x22, #2
ldrsb x25, [x24]
mov x11, x23
subs w11, w25, w11
strb w11, [x24]
strb wzr, [x20, x21]
Any loop with I/O or nested loops—even something like setting something to 0
are not covered by this.
Scan loops #
This code:
[>>>>]
Gets optimized to this:
adrp x22, lIDX@PAGE ; search-0
ldr q0, [x22, lIDX@PAGEOFF] ; v0 = idx vector
adrp x22, lSTRIDE4@PAGE
ldr q3, [x22, lSTRIDE4@PAGEOFF] ; v3 = stride mask
movi.2d v1, #0 ; v1 = zero vect
subs x21, x21, #16
LBB0_0:
add x21, x21, #16
add x22, x20, x21
ld1 {v2.16b}, [x22] ; v2 = tape
cmeq.16b v4, v2, v1 ; v4 = tape == zeros
and.16b v4, v4, v3 ; mask with stride
orn.16b v4, v0, v4 ; idx or !(tape == zeros)
uminv.16b b5, v4 ; find smallest idx
umov w22, v5.b[0]
subs w11, w22, #255
beq LBB0_0
add x21, x21, x22
The first few lines load up the 0 and stride mask vectors.
The only strides I optimize are 1, -1, 2, -2, 4, -4, 8, and -8. At stride 16, you’re only getting one check per loop; might as well just do the basic iteration instead of firing up the vector unit.