BranchTaken

Hemlock language insights

Practical LR(1) Parsing

Back in the 1980s parsers were predominantly hand-written recursive descent or machine-generated LALR(1). When ANTLR came along, machine-generated LL(k) parsing gained significant mindshare among programmers who found it easier to reason about parsing top-down than bottom-up. While LL(k) parsing is theoretically more limited than LR(k), both approaches can work well in practice, without allowing grammar ambiguity into the mix. Unfortunately the zeitgeist has since shifted toward beautiful abstractions of recursive descent parsing, namely PEGs and parser combinators. And somehow grammar ambiguity has been heavily discounted as a practical problem, to the great peril of the current generation of inexperienced programmers surveying the field.

Laurence Tratt wrote an in-depth blog article about his journey to preferring LR(1) parsing. It’s a valuable read, especially if you currently think that recursive descent parsing should be good enough for any project. His observations on recursive descent, LALR(1), GLR, and LR(1) closely match my own, so for now I won’t tell my own story that differs only in the details.

Full disclosure: even recursive descent parsing has reasonable applications, e.g. for simple data format parsers. GLR can do a passable job in static analysis tools for established languages like C++. But for projects like Hemlock, it’s a terrible risk to incrementally develop a grammar without the guard rails provided by a parser generator. Opportunities for introducing unintended ambiguities abound, so we are using LR(1) parser generation.

LR(1) error reporting/recovery and (projected) solutions

LR(1) parsing is unambiguous and fast, but even once over the hurdles of shift-reduce conflicts and precedence overrides, there’s no denying that there are some challenges to making it work well, particularly with regard to syntax error reporting and recovery.

Our initial plan was to report only the first error during manually triggered compilation. But then the increasingly broad integration of LSP support into open source text editors made it possible for Hemlock to provide a powerful development experience, and error recovery became a requirement. The OCaml-specific Merlin had already been a boon for us during early Hemlock development, so I read an authors’ experience report [1] to learn more about how it works. That paper motivated us to work hard to avoid problems, rather than endlessly working to solve hard problems. For example, macro preprocessing is on Hemlock’s anti-feature list now, but it also motivated Hemlock’s rigid indentation style, viz:

At any point, the recoverer has the choice between synthesizing one more value or consuming a token. Systematically trying both would lead to unacceptable performance. The main heuristic that drives Merlin’s implementation is indentation, on the assumption that lexical indentation follows nesting of constructions in the grammar.

[..]

While the indentation assumption is not true in practice for all indentation styles, this heuristic gives a satisfying behavior most of the time and should work for many languages.

We found the idea behind this heuristic quite compelling, but only much later did we decide to double down on the general idea and make Hemlock’s code indentation semantically meaningful, so that indentation style variation cannot thwart this recovery mechanism. Hemlock requires four-space block indentation, and two-space continuation indentation. (Single-line and \␤ escape hatches exist, but they are only compelling for very limited use cases.) The following Array.filter code serves as a representative indentation example.

let filter ~f t =
    let _, t' = ArrayFoldiMap.to_accum_array (ArrayFoldiMap.init t ~f:(fun _ i _ ->
        let rec fn i elm =
            let i' = succ i
            match f elm with
            | true -> i', elm
            | false -> fn i' (get i' t)
        fn i (get i t)
      ) (count ~f t)) ~init:0
    t'

The Merlin experience report led us to a paper on LR(1) error reporting [2] in the context of Menhir, which is the LR(1) parser generator that Merlin and OCaml use. The paper presents what I would describe as a brute force approach, and while quite effective, it appears to require quite a lot of manual effort that would be an ongoing burden for a rapidly evolving grammar (a near certainty for any immature programming language). Nonetheless, this was our best hope for Hemlock until we encountered a recovery approach based on minimum-cost repair sequences [3, 4]. We are excited to implement this approach because we think it will improve both error reporting and recovery.

Citations/footnotes

  1. Frédéric Bour, Thomas Refis, and Gabriel Scherer, “Merlin: A Language Server for OCaml (Experience Report),” Proc. ACM Program. Lang. Vol. 2. ICFP, Article 103, September 2018. 

  2. François Pottier, “Reachability and error diagnosis in LR(1) parsers,” International Conference on Compiler Construction (CC), pages 88—98, March 2016. 

  3. Lukas Diekmann and Laurence Tratt, “Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers,” 34th European Conference on Object-Oriented Programming (ECOOP 2020), Article No. 6, pages 6:1–6:32. 

  4. Hey, that’s the second reference to Laurence Tratt in this blog post! I didn’t notice the overlap until entering the citations.