Now Back to Our Regularly Scheduled Programming Language

Work on the hocc parser generator began in early 2022. Here we are over two years later and hocc is only now nearing completion. This is mostly due to an inadvertent nerd snipe, credit due to Laurence Tratt [¹]. I took a long detour to the land of IELR(1) [²] and back, and hocc eventually gained a practical IELR(1) implementation. What started several months ago as a cautionary document about algorithmic complexity congealed into a technical report targeted at other aspiring IELR(1) implementers. The report only hints at how extremely challenging this project was; this blog post describes the difficulties I encountered to blunt their impact on others who may choose to implement IELR(1).

IELR(1) rocks

I previously wrote about the problem IELR(1) solves relative to simpler algorithms, especially LALR(1) [³] and the Pager algorithm [⁴], referred to hereafter as PGM(1) (Practical General Method). In brief, these simpler algorithms can combine state machine states in ways that introduce parser flaws; in contrast IELR(1) preserves parser behavior while removing (almost) all state redundancy. The “almost” qualifier turns out to be more relevant than I’d expected; more on that in a bit. But IELR(1) really does eliminate a great deal of state redundancy without ever introducing parser flaws.

IELR(1) is subtle

Okay, IELR(1) actually works, so how bad can it be? Well, it is one of the most algorithmically subtle projects I’ve ever undertaken, and I got things wrong along the way in so many different ways that I only arrived at a working implementation through sheer stubbornness. In broad strokes, here is how IELR(1) works.

Generate LALR(1) parser states for reference throughout later phases. LALR(1) ignores lookahead symbols and merges all “isocoric” states, whereas LR(1) merges only identical kernels.
Trace “lanes” backward from conflicted states to determine which states may contribute to those conflicts. Then perform various graph analyses to eliminate some of those contributions from further consideration.
Generate IELR(1) parser states, referencing the previously computed metadata to split isocores only when merging would risk introducing parser flaws.

These are the main challenges to overcome:

The relevant lanes are conceptually straightforward, in that each is a path back from a conflict state to a start state. But there can be cycles (an infinitude of lanes), and similarly there can be forks and joins (a combinatorial explosion of lanes). Furthermore, there can be multiple relevant conflict contexts at intermediate states within lanes. Getting the data structures and control flow right to recursively trace these conflict lanes is crucial. And I got it wrong in several major ways, and several dozens of minor ways.
What static information is pertinent to the IELR(1) state split/merge logic? And what dynamic information must be tracked as IELR(1) states are generated? The static/dynamic metadata distinction is surprisingly hard to get right, and I misplaced metadata in both directions, sometimes repeating mistakes as I worked through flawed metadata interactions.

Grokking IELR(1) is hard

I found the details of the IELR(1) paper overwhelmingly difficult to understand; I reached adequate understanding only via substantial reinvention. The paper is ~37 pages long, and it is dense and intricate. Reading it straight through is an exhausting full day of work, and I did so several times. I also read major swaths of the paper dozens of times, often dedicating several hours at a time to dig into details. Before getting into what made the paper so difficult, first let me mention that I never found any overt flaws beyond missing space between some pairs of words, presumably due to a mistake late in the publisher’s typesetting process. This attention to detail astounds me. Now, on to some of the problems:

The paper felt as though it back-filled formalisms to match a very specific implementation strategy, namely that of integration into bison. This put an LALR(1) (a la DeRemer [³]) slant on both the terminology and exposition, with significant attention paid to memoization optimizations. I had a hard time figuring out which details to ignore as incidental.
The paper provided both formal and informal terminology for many concepts. This would have been a kindness, were it possible to ignore one or the other, but instead I had to grok both, and many times I had a hard time keeping enough terminology in my head.
The paper provided concrete examples to demonstrate various definitions and algorithms. At first blush this seems a good thing, but I came to see it as mitigation of the paper’s close ties to a specific implementation. I had to work through these examples, reverse-engineer intent, then apply that to my implementation. Of course I erred in myriad ways.

IELR(1) validation is enigmatic

As the saying goes, “Perfect is the enemy of the good”, but in point of fact, IELR(1) is merely almost perfect, i.e. it does not guarantee minimal state machines. This left room for uncertainty regarding the correctness of results, and the lack of a relevant test suite confounded efforts to bridge the gap between simple examples and full-fledged grammars like those of gpic and gawk. Even my early implementation attempts generated valid automata for all the example grammars in the IELR(1) paper, but only in retrospect can I see how hopelessly flawed those attempts were.

The concept of remergeable states is introduced in Definition 3.49 of the IELR(1) paper. In a nutshell, it is possible for state splitting to be overly aggressive (better than being too lax, which could introduce parser flaws). The details would require excessive background to make sense of here, but the gist is that a state can be the result of multiple merges, and interactions between merges and conflict resolution can propagate to successor states in ways that cause needless state splitting.

It would be possible for such split states to be remerged during an additional pass. The paper makes this sound inconsequential, to wit: “[…] we have not attempted to formulate the details of such an algorithm as we have not discovered a practical [grammar] that would benefit significantly.” That does not match my observation. Specifically, I reproduced the gpic and gawk test cases from the paper ⁵, and remergeable states were an ongoing challenge to reproducing results. I eventually arrived at a correct IELR(1) implementation which generated remergeable states, and I manually audited those states to verify that their creation was algorithmically valid. In order to achieve identical results between bison and hocc I had to implement state remerging.

Why doesn’t bison need state remerging for these grammars? My presumption is that bison and hocc are generating the state graph in different orders, and there are starting point dependencies which incidentally manifest as remergeable states only for hocc. That said, it is conceivable that bison is too permissive in its state compatibility testing during merging. I have no concrete evidence of this, but I observed results from hocc that were nearly indistinguishable from those of bison before I fixed compatibility testing bugs in hocc. Furthermore, I discovered two bison behaviors that to my mind are at least design flaws, if not outright bugs:

Disabling conflict resolution should cause preservation of LR(1) conflicts, but bison instead appears to preserve LALR(1) conflicts. This in effect disables IELR(1) rather than generating an LR(1)-equivalent parser with conflicts left unresolved. In contrast, hocc punts on inadequacy elimination only where the equivalent LR(1) parser would also have unresolved conflicts.
Productions are assigned implicit precedence based on declaration order, but it is possible to use %prec clauses to override/disable production precedence. bison ignores this possibility (i.e. unresolved conflicts can persist) when resolving conflicts in the IELR(1) implementation. In contrast, hocc supports only explicit precedences, which exposes unintentional grammar ambiguities.

As troublesome as remergeable states proved to be, being forced to implement state remerging (and unreachable state removal) was serendipitous. These postprocessing passes turn out to be practical for PGM(1) and LR(1) as well as IELR(1) — read the report for details.

Next steps

The IELR(1) detour alone took about 900 hours of active work, and hocc isn’t quite done. In particular:

There is no code generation yet, and we need both OCaml and Hemlock code generation until Hemlock is bootstrapped.
We designed Hemlock syntax to simplify syntax error recovery [⁶], but we still need to actually implement it.

That said, the way forward is clear, at least with regard to hocc, and with that the count of IELR(1) implementations has risen to two! I sincerely hope that there will be many more to come. For information on hocc, see the following resources:

Now back to our regularly scheduled programming language.

Citations/footnotes

Laurence Tratt, “Which Parsing Approach?”, https://tratt.net/laurie/blog/2020/which_parsing_approach.html, footnote 21. ↩
Joel E. Denny and Brian A. Malloy, “The IELR(1) algorithm for generating minimal LR(1) parser tables for non-LR(1) grammars with conflict resolution”, Science of Computer Programming, 75(11):943-979, 2010. ↩
Frank DeRemer, “Practical Translators for LR(k) languages”, Ph.D Dissertation, Department of Electrical Engineering, Massachusetts Institute of Technology, Cambridge, 1969. ↩ ↩²
David Pager, “A Practical General Method for Constructing LR(k) Parsers”, Acta Informatica 7:249-268, 1977. ↩
hocc lacks two relevant bison features: intermediate reductions and nonassoc associativity. Intermediate reductions were simulated via equivalent grammar modifications. nonassoc associativity is in my opinion a misfeature, so rather than implementing it in hocc I removed all uses from the input grammars and verified that doing so had no relevant impact on bison’s parser generation. ↩
Lukas Diekmann and Laurence Tratt, “Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers,” 34th European Conference on Object-Oriented Programming (ECOOP 2020), Article No. 6, pages 6:1–6:32. ↩

BranchTaken

Hemlock language insights