Evolving Languages Faster with Type Tailoring
15 Jul 2024
Programming languages are too slow! I’m not talking about execution speed—I’m talking about evolution speed. Programmers are always building new libraries and embedded DSLs, but the host programming language—particularly its type system—doesn’t understand the domain-specific aspects of these things.
Example problem: my type system doesn’t speak PCRE #
Consider regular expressions—most programmers would understand that a regular expression like [a-z]+([0-9][0-9])
, if it matches, will capture two digits in the first capture group. If I try to write this code in Rust or Typed Racket, the type checker complains:
|
|
error:
rustc [E0277]: `Option<regex::Match<'_>>` doesn't implement `std::fmt::Display`
the trait `std::fmt::Display` is not implemented for `Option<regex::Match<'_>>`
|
|
error:
example.rkt:7:22: Type Checker: Polymorphic function `second' could not be applied to arguments:
Types: (List* a r (Listof t)) -> (r : ((! (cadr (0 0)) False) | (: (cadr (0 0)) False)) : (cadr (0 0)))
(Listof a) -> a
Arguments: (Pairof String (Listof (U False String)))
Expected result: String
The (U τ₁ τ₂)
syntax is a type union of types τ₁
and τ₂
, whatever types those are. The equivalent of Option<String>
in Typed Racket is (U String False)
.
The problem is that getting the first capture group (caps.get(1)
in Rust, (second m)
in Typed Racket) returns an optional type (Option<regex::Match>
in Rust, (Listof (U False String))
. The thing is, we know that since the regex match succeeded on line 5, caps.get(1)
(or (second m)
) should definitely succeed because there was one capture group in the regex we matched. Instead, we’re forced to unwrap in Rust:
println!("id number: {}", caps.get(1).unwrap().as_str());
Likewise, in Typed Racket, we have to insert some casts and checks manually to convince the type checker that this code should run.
This is a small example; the key point is this: type systems usually aren’t smart enough to know about the structure of regexes—all the compiler sees are opaque strings. This goes for things beyond regular expressions. Consider SQL queries, which are so often embedded as strings in languages: when a query goes wrong, you usually only find out at runtime.
What would a solution look like? Some people have tried making a new way of writing regexes with composable objects that the type system has a better chance of understanding; that is a neat approach, but that both requires me to rewrite my program and doesn’t solve the issue of indexing into lists of known size (the results of a capture) more efficiently.
Another option would be to make the type system smarter. But that too has its drawbacks: language implementers are often leery of tinkering with the type checker—and rightly so! So much relies on the type checker being correct. Moreover, even if you did make the type checker able to understand regexes, someone’s going to build a new library tomorrow that will stump your type checker just as before.
Here’s a more attractive option: we can use the metaprogramming tools the language provides to teach the type system some new tricks. This is something that end-users of languages can do without waiting for the language designer. We call this type tailoring.
Type Tailoring is the title and subject of a paper I wrote with my advisor Ben Greenman and our coauthors Stephen Chang and Matthias Felleisen. It has been accepted at European Conference on Object-Oriented Programming (ECOOP). Like the ACM conference OOPSLA, ECOOP has in recent years focused on more than object-oriented programming. The name stuck around. ¯\_(ツ)_/¯ You can get a preprint here.
Sketch of a solution #
Here’s a high-level sketch of how we would solve the problem:
- Something would notice that the regex has a single capture group.
- The
re.captures
function would get this information and update the its type. - This information would further by leveraged by the type of
caps
, to indicate thatget(0)
orget(1)
will always succeed.
Some years ago, my advisor made the trivial
library for Typed Racket. It can tailor the following code so that it typechecks and runs efficiently:
The trivial
library is available as a Racket package. If you have Racket installed on your system, run raco pkg add trivial
to install it.
#lang typed/racket
(require trivial trivial/list) ;; add this to tailor program
(define (user-idnum (username : String)) : Number
(define re "[a-z]+([0-9][0-9])")
(define m (regexp-match re username))
(if m
(string->number (second m))
(error "bad username")))
(printf "id number: ~a\n" (user-idnum "dent42"))
Type Checker: Polymorphic function `second' could not be applied to arguments:
Types: (List* a r (Listof t)) -> (r : ((! (cadr (0 0)) False) | (: (cadr (0 0)) False)) : (cadr (0 0)))
(Listof a) -> a
Arguments: (Pairof String (Listof (U False String)))
Expected result: String
id number: 42
For specific details on how the trivial
library works, see § Appendix: How does the trivial library work? at the end of this post.
The problem is that, like Rust, Typed Racket must assign an overly conservative type to the result of matching a regular expression. Consequently, the programmer has to insert casts. The trivial
library can analyze Typed Racket and insert these casts and checks automatically. The end-result for the user is that this code Just Works™ as you would expect.
Notice that the trivial
library is a library—it doesn’t require modifications to the compiler or type checker or anything. This means that normal users of programming languages can create their own tailorings without breaking the compiler or messing with the build pipeline.
Supporting type tailoring #
What do you need to make type tailoring work? Let’s step back a second and look at what we need to do in the first place. Our problem is that the type checker doesn’t know as much about our program as we do. What we can do to teach the type checker is program the elaboration step: surface syntax typically doesn’t have type annotations at every point; elaboration takes the syntax that a programmer writes, and adds types and type holes wherever needed. This elaborated syntax gets sent off to the constraint solver for type checking and inference.
How do we program the elaboration step? Almost all languages that have macros do type checking after macroexpansion. This is crucial for type tailoring. We can write macros that add checks, casts, type annotations, or whatever else we need to make the type checker happy.
Here are the key features that you must have to make type tailoring work:
- Type checking after elaboration
- Type checking must come after elaboration to check the results of tailoring. Without this, it would be too easy to break the type system. Furthermore, if type checking comes after elaboration, we can leverage all the power of the type checker to do the heavy-lifting for us; all a tailoring has to do is give a few hints to the type checker here and there.
- Elaboration-time computation
- Most of the time this means that you need procedural macros. Pattern-based macros (such as
syntax-case
from Scheme ormacro_rules!
from Rust) can only rearrange syntax, in a pattern → pattern transformation, and can’t perform arbitrary rewrites. - AST datatype
- Without an AST datatype, tailorings are limited to using a token stream. Rust’s procedural macros operate on token streams unfortunately. Sometimes it’s possible to convert a token stream to an AST, but you loose metadata and becomes unwieldy quickly.
These are the essential elements, without which tailoring can’t happen. Besides these three things, you also will want some of the following tailoring features:
- Hygienic macros
- Hygienic macros avoid the variable capture problem. For more on the variable capture problem, see my post about fearless macros. In other words, I shouldn’t have to be concerned about the internals of the macros that I use. This also makes it so that I can compose macros with each other.
- Metadata
- Metaprogramming systems that can attach metadata directly to AST nodes can share information between different tailorings easily. (Keeping compile-time state off to the side is an alternative.)
- Controlling the order of expansion
- Tailorings that cooperate often need a way to control the order in which they run: one tailoring might depend on the results of another, and a third tailoring might analyze the output further.
- Accessing external data
- Some of the coolest tailorings reached out to external sources of data to augment type checking. Rust actually has a neat library called SQLx that, at compile time, checks SQL strings against the schema of a database. There are several systems that do something similar.
- Type information
- A few of the systems that we looked at (Idris 1 and Scala 3) could inspect the types of arguments to macros. After expansion, the type checker would run again to check that the transformation’s result was well-typed. Since there were so few examples of this, it’s hard to say just how beneficial this is.
No language supports all of these features—that’s exciting because it means there’s room to explore! In our paper we go into detail about each of these features and the kinds of tailoring they enable. We also have a chart showing how a handful of languages stack up against each other.
You might have invented type tailoring #
We invented the term “type tailoring”, but we didn’t invent the idea—programmers have wanted to teach their type systems how to understand domain-specific concerns for a long time. Here are just a few existing projects we found that were doing type tailoring:
-
Rust’s SQLx library reaches out to the database at compile-time to check if the schema in the code matches how the database is set up. This will warn you at compile-time if your query is malformed.
-
Julia’s StaticArrays package rewrites lists of a static, known size into tuples. This lets the compiler track how long the lists are and automatically eliminates lots of bounds checks—handy when you’re doing lots of numerical work.
-
Elixir’s Phoenix web framework will check routes in your template files against your route handler; if you make a typo or forget to implement a handler for a route, Phoenix will warn you at compile-time. This feature is called verified routes.
Again, that’s just a small sample. Please see our paper for details on how these projects are using type tailoring, as well as for more examples that we found.
Type tailoring: new terms, new libraries, new horizons #
The big contributions of our paper are:
-
We introduce the term type tailoring. The ideas have appeared in many forms across many languages, but there hasn’t been any underlying rationale unifying their efforts. Now that we’ve identified the phenomenon, we can talk about and support it directly and consciously.
-
We identified the main things you need to make tailoring work. Language designers can use this to build in better support for type tailoring in their languages.
-
We show users how tailorings can balance ease-of-use with features typically only found in dependent type systems.
Furthermore, we built two libraries: trivial
for Racket—which tailors things like vectors, regular expressions, etc., and Dyn
for Rhombus—which turns Rhombus into a gradually-typed language through a tailoring. We expect more will be built in the future.
Again, please see our paper for all the details. Our paper comes with an artifact that contains all the code in the paper. You can simply download a Docker container to run the code and verify all our claims. Yay for reproducible research!
If you have any questions, feel free to email me. (Email in the paper, as well as here on my blog.) If you’re going to ECOOP in Vienna this year, let me know and we can talk in person there!
Appendix: How does the trivial library work? #
Here is the example with the trivial
library that we saw in § Sketch of a solution:
#lang typed/racket
(require trivial trivial/list) ;; add this to tailor program
(define (user-idnum (username : String)) : Number
(define re "[a-z]+([0-9][0-9])")
(define m (regexp-match re username))
(if m
(string->number (second m))
(error "bad username")))
(printf "id number: ~a\n" (user-idnum "dent42"))
Normally, Typed Racket rejects the code because the type of m
is too general to be applied to second
. Here is how the trivial
library tailors the example to make it work:
-
First, it overrides the implicit
#%datum
form that wraps literal values like the string"[a-z]+([0-9][0-9])"
; this lets it read the string and collect any interesting information about it at compile time. -
The library sees that the string has one set of matched parentheses; moreover, the pattern inside the parentheses consists entirely of digits. It attaches this information as a syntax property to the syntax object for that string.
-
This information gets propagated to all occurrences of the identifier
re
. -
The library also overrides
regexp-match
, so that it looks at the syntax properties on its first argument. In this case, it sees thatre
is a string with one capture group. The library updates the return type ofm
from(Pairof String (Listof (U False String)))
to(U (List String String) False)
. -
In the true branch of the
if
statement, Typed Racket is automatically able to refine the type ofm
to(List String String)
. -
The
trivial
library overridessecond
to check the type of its argument; it sees that(List String String)
is long enough for this call to succeed, so it tailors this to a faster, unsafe lookup function. -
string->number
also gets overridden to look at the information about the match. Since step 2 was able to see that the match consists only of digits, it updates its type from returning(U Complex False)
toNumber
.
That’s a lot going on! The neat thing is that trivial
is able to do all this in a fairly generalized way: one component works with strings, another works with regular expressions, and another works with lists of known size. They’re able to share all this information through syntax properties which respect the scoping rules of Typed Racket. It also plays nicely with other metaprogrammings; we could have written a macro that e.g., turns if
into not-if
and flips the branches, but the information we needed about the m
variable still would have gotten to the right place.
Unfortunately, there’s not a way right now that we could make this example work for Rust—at least, not in its current form. That’s because different languages have different support for different kinds of tailoring. In our paper, we explore all the different dimensions for how languages can support tailorings.
Appendix: Building a tiny type tailoring #
Here is how you might build a little tailoring. In this tailoring, we’d like to turn array lookups that are known to be in-bounds into versions that use a fast, unsafe lookup. In a dependently typed language, we would be able to do this by looking at index and the type of the vector, since the type of the vector would include its length. In Racket, we have no such power. However, we can get a little bit of that power through a tailoring.
This section is taken almost verbatim from §3.8 from our paper. This code actually runs!
Here’s an example of what we would like to tailor: if we know the length of a vector and the index, then we can either tailor to unsafe-vector-ref
, which skips the bounds check, or tailor to an error if the index is out-of-bounds. Otherwise, if we don’t know either the length of the vector or the index, we stick with the safe vector-ref
function which does run the bounds check:
(vector-ref (vector 5 2 8) 1) ; tailors to → (unsafe-vector-ref (vector 5 2 8) 1)
(vector-ref (vector 4 9 1) 4) ; tailors to → error: out of bounds
(vector-ref (read-vec) 9) ; tailors to → (vector-ref (read-vec) 9)
Here’s how we do it:
|
|
The (provide (rename-out …))
bit says that, when this module gets imported, export the function tailored-vector-ref
under the name vector-ref
. This means, whenever this module gets imported, all calls to vector-ref
automatically use tailored-vector-ref
.
|
|
Now we import some helpers: racket/unsafe/ops
gives us the unsafe-vector-ref
function, which is what we tailor to. This function can segfault if misused, so we have to be careful. We pull in syntax/parse
to get the excellent syntax-parse
macro facility. We also pull in four symbols from tailoring-api.rkt
, which deserve a mention:
⇝
- This is a syntax class that triggers macro expansion on a subexpression, which allows this tailoring to discover static information about its different components.
φ
- This is a function that uses Racket’s syntax properties to store and retrieve static information.
V
,I
- These are unique keys (created with
gensym
) that we use for looking up vector length and integer value information. Since they’re unique, other tailorings won’t accidentally collide with this information.
The file tailoring-api.rkt
is small and really just provides these convenience functions; you can find it in our artifact.
Now comes the meat of the tailoring: the tailoring is a macro so that it can statically rewrite source code. First it parses its input syntax object stx
to extract and expand two subexpressions e1
and e2
:
|
|
Now the tailoring checks whether these expanded expressions have the static information needed; specifically, it looks for a vector length (key: V
) and an integer value (key: I
):
|
|
If the information is present, we expand to unsafe-vector-ref
if safe to do so; otherwise we expand to code that raises an error. If the information is not present, fall back to plain-old vector-ref
:
|
|
That’s it! A Racket program can import this tailoring to make this code run faster:
#lang racket
(require "mini-tailoring.rkt")
(vector-ref (vector 5 2 8) 1) ; gets tailored to (unsafe-vector-ref (vector 5 2 8) 1)
Save that program to a file and run raco expand <filename>
to see the expanded version; you should see unsafe-vector-ref
in the expanded code.