BranchTaken

Hemlock language insights

All Aboard the Bootstrap Transpiler Express

Locomotive

The Hardly Original Compiler Compiler (hocc) is now fully functional! Jump aboard for some compelling late-stage developments, and stick around to read what comes next for Hemlock development.

hocc developments

hocc development started in early 2022, and it turned out to be a much more challenging project than intended due to inclusion of the IELR(1) algorithm. See these previous blog posts for information about this and other challenges.

Bootstrapping

hocc is a parser generator, and as the “compiler compiler” moniker suggests, parser generators are a curious part of compiler bootstrapping, because they themselves are compilers which need bootstrapping. I manually wrote a recursive descent parser for hocc, and until quite recently my intention was to keep it indefinitely in order to simplify Hemlock bootstrapping. However, it became clear while implementing the hocc code generator (aka the compiler back end) that there were some creaky parts of the grammar that would require non-trivial refactoring. Anyone who tells you that recursive descent parsers are easy to work with is unlikely to have experienced the productivity gains afforded by a parser generator relative to the soul-crushing, error-prone slog of manual parser refactoring. Consider the following function from pre-bootstrapped hocc, which will serve to illustrate this and other later points.

and code_tl spine ctx =
  let spine = "code_tl" :: spine in
  let ctx', tok = next ~all:true spine ctx in
  match tok with
  | HmcToken {atok=Hmc.Scan.AbstractToken.(
    Tok_indent _|Tok_lparen|Tok_lcapture|Tok_lbrack|Tok_larray
    |Tok_lcurly); _} ->
    mapr ~child:delimited ~f:(fun spine ctx' delimited ->
      map ~child:code_tl ~f:(fun code_tl ->
        CodeTlDelimited {delimited; code_tl}
      ) ~fmt_child:fmt_code_tl spine ctx'
    ) spine ctx
  | HmcToken {atok=Hmc.Scan.AbstractToken.(
    Tok_dedent _|Tok_rparen|Tok_rcapture|Tok_rbrack
    |Tok_rarray|Tok_rcurly|Tok_line_delim|Tok_semi|Tok_bar); _} ->
    reduce spine ctx fmt_code_tl CodeTlEpsilon
  | HmcToken _ as token_ ->
    map ~child:code_tl ~f:(fun code_tl ->
      CodeTlToken {token_; code_tl}
    ) ~fmt_child:fmt_code_tl spine ctx'
  | _ -> reduce spine ctx fmt_code_tl CodeTlEpsilon

There’s some hair in this code (spine and fmt_*) related to parser tracing, a tax that all parsing functions had to pay. And some mildly mind-warping tail-recursive dispatch in mapr that I won’t get into. Now, take a look at the pattern with Tok_dedent and friends in it. That’s a follow set, i.e. token types that follow the CodeTlEpsilon grammar rule; they are excluded from CodeTlToken. There are footguns here.

Contrast that with the equivalent hocc code.

    nonterm CodeTl of nonterm_code_tl ::=
      | delimited:Delimited code_tl:CodeTl ->
        CodeTlDelimited {delimited; code_tl}
      | code_token:CodeToken code_tl:CodeTl ->
        CodeTlCodeToken {code_token; code_tl}
      | epsilon prec pCodeTl -> CodeTlEpsilon

The addition of the CodeToken grammar rule was a direct result of hocc pointing out a grammar ambiguity that the recursive descent parser accidentally dodged via context-sensitive token classification. The bootstrapped hocc parser is more reliable than its predecessor, in addition to being easier to work with.

Follow-on creature comforts

The impetus for bootstrapping the hocc parser was a minor flaw in type syntax, but the ease of refactoring also made two enhancements feasible:

Bootstrap transpilers

Hemlock development thus far has laid the groundwork for writing the initial Hemlock compiler in OCaml. That currently amounts to ~70,000 lines of OCaml code, not counting ~24,000 lines of hocc-generated code. That’s a lot of groundwork, considering that OCaml is already a good compiler implementation tool. The catch is that we wrote an entire standard library in OCaml called Basis that closely matches the intended Hemlock Basis. The idea was that even though we’d be faced with eventually rewriting the OCaml-based compiler in Hemlock, the rewrite would be largely mechanical due to similar standard libraries, which is a lot easier than a compiler redesign. Sometime while implementing both Hemlock and OCaml back ends for hocc I realized just how mind-numbingly mechanical this rewrite would be. Sounds like a job for a transpiler, better yet two transpilers!

We can use a pair of transpilers, which for now I’ll call mlc for OCaml->Hemlock, and hmc for Hemlock->OCaml. They can share an intermediate representation (IR) that starts off quite simple and grows in functionality as we implement aspects of semantic analysis. The journey from today to a self-hosting Hemlock system requires roughly the following steps.

This approach has the obvious advantage of replacing manual OCaml->Hemlock rewriting with the effort of writing mlc. Even better, we can start running non-trivial Hemlock programs much sooner than if the native runtime were a hard dependency. There’s likely to be at least one round of Advent of Code based on the OCaml back end, perhaps even this decade!