All Aboard the Bootstrap Transpiler Express
The Hardly Original Compiler Compiler
(hocc
) is now fully
functional! Jump aboard for some compelling late-stage developments, and stick around to read what
comes next for Hemlock development.
hocc developments
hocc
development started in early 2022, and it turned out to be a much more challenging project
than intended due to inclusion of the IELR(1) algorithm. See these
previous blog posts for information about this and other challenges.
- Hardly Original Compiler Compiler
- Hardly Hazardous Hoccing Gleans Great Grammars
- Parsing Retrospective
- More Precise Documentation
- Constructive Precedence Relationships
- State of the hocc Grammar
- Recognizing Follow Sets
- Art of the State
- Resolution Convolution
- Now Back to Our Regularly Scheduled Programming Language
Bootstrapping
hocc
is a parser generator, and as the “compiler compiler” moniker suggests, parser generators are
a curious part of compiler bootstrapping, because they themselves are compilers which need
bootstrapping. I manually wrote a recursive descent
parser for hocc
, and until quite recently
my intention was to keep it indefinitely in order to simplify Hemlock bootstrapping. However, it
became clear while implementing the hocc
code generator (aka the compiler back end) that there
were some creaky parts of the grammar that would require non-trivial refactoring. Anyone who tells
you that recursive descent parsers are easy to work with is unlikely to have experienced the
productivity gains afforded by a parser generator relative to the soul-crushing, error-prone slog of
manual parser refactoring. Consider the following function from pre-bootstrapped hocc
, which will
serve to illustrate this and other later points.
and code_tl spine ctx =
let spine = "code_tl" :: spine in
let ctx', tok = next ~all:true spine ctx in
match tok with
| HmcToken {atok=Hmc.Scan.AbstractToken.(
Tok_indent _|Tok_lparen|Tok_lcapture|Tok_lbrack|Tok_larray
|Tok_lcurly); _} ->
mapr ~child:delimited ~f:(fun spine ctx' delimited ->
map ~child:code_tl ~f:(fun code_tl ->
CodeTlDelimited {delimited; code_tl}
) ~fmt_child:fmt_code_tl spine ctx'
) spine ctx
| HmcToken {atok=Hmc.Scan.AbstractToken.(
Tok_dedent _|Tok_rparen|Tok_rcapture|Tok_rbrack
|Tok_rarray|Tok_rcurly|Tok_line_delim|Tok_semi|Tok_bar); _} ->
reduce spine ctx fmt_code_tl CodeTlEpsilon
| HmcToken _ as token_ ->
map ~child:code_tl ~f:(fun code_tl ->
CodeTlToken {token_; code_tl}
) ~fmt_child:fmt_code_tl spine ctx'
| _ -> reduce spine ctx fmt_code_tl CodeTlEpsilon
There’s some hair in this code (spine
and fmt_*
) related to parser tracing, a tax that all
parsing functions had to pay. And some mildly mind-warping tail-recursive dispatch in mapr
that I
won’t get into. Now, take a look at the pattern with Tok_dedent
and friends in it. That’s a follow
set, i.e. token types that follow the CodeTlEpsilon
grammar rule; they are excluded from
CodeTlToken
. There are footguns here.
- The follow set is explicitly enumerated rather than calculated/encapsulated, here and elsewhere in the parser. But if we don’t do it this way, then we give up on pattern matching.
- Manually listing all the valid token types for the
CodeTlToken
rule would be onerous, and in fact the parser has several latent bugs as a result of using catchall patterns. But exhaustively listing the token types would have an even worse copypasta effect than for the follow set pattern.
Contrast that with the equivalent hocc
code.
nonterm CodeTl of nonterm_code_tl ::=
| delimited:Delimited code_tl:CodeTl ->
CodeTlDelimited {delimited; code_tl}
| code_token:CodeToken code_tl:CodeTl ->
CodeTlCodeToken {code_token; code_tl}
| epsilon prec pCodeTl -> CodeTlEpsilon
The addition of the CodeToken
grammar rule was a direct result of hocc
pointing out a grammar
ambiguity that the recursive descent parser accidentally dodged via context-sensitive token
classification. The bootstrapped hocc
parser is more reliable than its predecessor, in addition to
being easier to work with.
Follow-on creature comforts
The impetus for bootstrapping the hocc
parser was a minor flaw in type syntax, but the ease of
refactoring also made two enhancements feasible:
- In several places it was necessary to do secondary pattern binding inside reduction callback code
blocks. This required semi-duplicated callback code blocks with a bunch of boilerplate. For
example:
nonterm CodeToken of X.nonterm_code_token ::= # [...] | token_:UIDENT -> let X.(UIDENT {token_}) = token_ in CodeToken {token_} | token_:CIDENT -> let X.(CIDENT {token_}) = token_ in CodeToken {token_} | token_:"_"-> let X.(USCORE {token_}) = token_ in CodeToken {token_} | token_:ISTRING -> let X.(ISTRING {token_}) = token_ in CodeToken {token_} # [...]
With richer binding patterns this becomes:
nonterm CodeToken of nonterm_code_token ::= # [...] | (UIDENT {token}):UIDENT | (CIDENT {token}):CIDENT | (USCORE {token}):"_" | (ISTRING {token}):ISTRING # [...] -> CodeToken {token}
- You may have noticed the use of
token_
as an identifier above. This was due to it being ahocc
keyword. But it turns out that the keywords can be made contextual, such that with the exception of some restrictions on thehocc
keyword, all the keywords can be used in embedded code without issue, as in theCodeToken
patterns and reduction callback just above. This would have been really hard to get right in a recursive descent parser, buthocc
enabled quick and safe refactoring (followed by quite a bit of misery manually removing all the unnecessary_
identifier suffix mangling).
Bootstrap transpilers
Hemlock development thus far has laid the groundwork for writing the initial Hemlock compiler in
OCaml. That currently amounts to ~70,000 lines of OCaml code, not counting ~24,000 lines of
hocc
-generated code. That’s a lot of groundwork, considering that OCaml is already a good compiler
implementation tool. The catch is that we wrote an entire standard library in OCaml called Basis
that closely matches the intended Hemlock Basis
. The idea was that even though we’d be faced with
eventually rewriting the OCaml-based compiler in Hemlock, the rewrite would be largely mechanical
due to similar standard libraries, which is a lot easier than a compiler redesign. Sometime while
implementing both Hemlock and OCaml back ends for hocc
I realized just how mind-numbingly
mechanical this rewrite would be. Sounds like a job for a
transpiler, better yet two transpilers!
We can use a pair of transpilers, which for now I’ll call mlc
for OCaml->Hemlock, and hmc
for
Hemlock->OCaml. They can share an intermediate representation (IR) that starts off quite simple and
grows in functionality as we implement aspects of semantic analysis. The journey from today to a
self-hosting Hemlock system requires roughly the following steps.
- Implement an OCaml scanner/parser and a Hemlock parser (scanner already implemented), both of which target the same IR, barely removed from the respective abstract syntax trees (ASTs).
- As soon as the current OCaml code can survive OCaml->Hemlock->OCaml transpilation without loss of
functionality, do a final OCaml->Hemlock translation, and throw away both the OCaml code and
mlc
. - Incrementally develop sufficient semantic analysis to support a native back end.
- Implement a native back end (x64 and/or arm64), as well as a native runtime.
- As soon the native back end and runtime can generate an
hmc
binary that can bootstrap Hemlock, discard the OCaml back end? Alternatively, keep Hemlock language requirements inhmc
simple enough thathmc
can continue to be bootstrapped by the OCaml compiler.
This approach has the obvious advantage of replacing manual OCaml->Hemlock rewriting with the effort
of writing mlc
. Even better, we can start running non-trivial Hemlock programs much sooner than if
the native runtime were a hard dependency. There’s likely to be at least one round of Advent of
Code based on the OCaml back end, perhaps even this decade!