BranchTaken

Hemlock language insights

Now Back to Our Regularly Scheduled Programming Language

Crocodile

Work on the hocc parser generator began in early 2022. Here we are over two years later and hocc is only now nearing completion. This is mostly due to an inadvertent nerd snipe, credit due to Laurence Tratt [1]. I took a long detour to the land of IELR(1) [2] and back, and hocc eventually gained a practical IELR(1) implementation. What started several months ago as a cautionary document about algorithmic complexity congealed into a technical report targeted at other aspiring IELR(1) implementers. The report only hints at how extremely challenging this project was; this blog post describes the difficulties I encountered to blunt their impact on others who may choose to implement IELR(1).

IELR(1) rocks

I previously wrote about the problem IELR(1) solves relative to simpler algorithms, especially LALR(1) [3] and the Pager algorithm [4], referred to hereafter as PGM(1) (Pager General Method). In brief, these simpler algorithms can combine state machine states in ways that introduce parser flaws; in contrast IELR(1) preserves parser behavior while removing (almost) all state redundancy. The “almost” qualifier turns out to be more relevant than I’d expected; more on that in a bit. But IELR(1) really does eliminate a great deal of state redundancy without ever introducing parser flaws.

IELR(1) is subtle

Okay, IELR(1) actually works, so how bad can it be? Well, it is one of the most algorithmically subtle projects I’ve ever undertaken, and I got things wrong along the way in so many different ways that I only arrived at a working implementation through sheer stubbornness. In broad strokes, here is how IELR(1) works.

  1. Generate LALR(1) parser states for reference throughout later phases. LALR(1) ignores lookahead symbols and merges all “isocoric” states, whereas LR(1) merges only identical kernels.
  2. Trace “lanes” backward from conflicted states to determine which states may contribute to those conflicts. Then perform various graph analyses to eliminate some of those contributions from further consideration.
  3. Generate IELR(1) parser states, referencing the previously computed metadata to split isocores only when merging would risk introducing parser flaws.

These are the main challenges to overcome:

Grokking IELR(1) is hard

I found the details of the IELR(1) paper overwhelmingly difficult to understand; I reached adequate understanding only via substantial reinvention. The paper is ~37 pages long, and it is dense and intricate. Reading it straight through is an exhausting full day of work, and I did so several times. I also read major swaths of the paper dozens of times, often dedicating several hours at a time to dig into details. Before getting into what made the paper so difficult, first let me mention that I never found any overt flaws beyond missing space between some pairs of words, presumably due to a mistake late in the publisher’s typesetting process. This attention to detail astounds me. Now, on to some of the problems:

IELR(1) validation is enigmatic

As the saying goes, “Perfect is the enemy of the good”, but in point of fact, IELR(1) is merely almost perfect, i.e. it does not guarantee minimal state machines. This left room for uncertainty regarding the correctness of results, and the lack of a relevant test suite confounded efforts to bridge the gap between simple examples and full-fledged grammars like those of gpic and gawk. Even my early implementation attempts generated valid automata for all the example grammars in the IELR(1) paper, but only in retrospect can I see how hopelessly flawed those attempts were.

The concept of remergeable states is introduced in Definition 3.49 of the IELR(1) paper. In a nutshell, it is possible for state splitting to be overly aggressive (better than being too lax, which could introduce parser flaws). The details would require excessive background to make sense of here, but the gist is that a state can be the result of multiple merges, and interactions between merges and conflict resolution can propagate to successor states in ways that cause needless state splitting.

It would be possible for such split states to be remerged during an additional pass. The paper makes this sound inconsequential, to wit: “[…] we have not attempted to formulate the details of such an algorithm as we have not discovered a practical [grammar] that would benefit significantly.” That does not match my observation. Specifically, I reproduced the gpic and gawk test cases from the paper 5, and remergeable states were an ongoing challenge to reproducing results. I eventually arrived at a correct IELR(1) implementation which generated remergeable states, and I manually audited those states to verify that their creation was algorithmically valid. In order to achieve identical results between bison and hocc I had to implement state remerging.

Why doesn’t bison need state remerging for these grammars? My presumption is that bison and hocc are generating the state graph in different orders, and there are starting point dependencies which incidentally manifest as remergeable states only for hocc. That said, it is conceivable that bison is too permissive in its state compatibility testing during merging. I have no concrete evidence of this, but I observed results from hocc that were nearly indistinguishable from those of bison before I fixed compatibility testing bugs in hocc. Furthermore, I discovered two bison behaviors that to my mind are at least design flaws, if not outright bugs:

As troublesome as remergeable states proved to be, being forced to implement state remerging (and unreachable state removal) was serendipitous. These postprocessing passes turn out to be practical for PGM(1) and LR(1) as well as IELR(1) — read the report for details.

Next steps

The IELR(1) detour alone took about 900 hours of active work, and hocc isn’t quite done. In particular:

That said, the way forward is clear, at least with regard to hocc, and with that the count of IELR(1) implementations has risen to two! I sincerely hope that there will be many more to come. For information on hocc, see the following resources:

Now back to our regularly scheduled programming language.

Citations/footnotes

  1. Laurence Tratt, “Which Parsing Approach?”, https://tratt.net/laurie/blog/2020/which_parsing_approach.html, footnote 21. 

  2. Joel E. Denny and Brian A. Malloy, “The IELR(1) algorithm for generating minimal LR(1) parser tables for non-LR(1) grammars with conflict resolution”, Science of Computer Programming, 75(11):943-979, 2010. 

  3. Frank DeRemer, “Practical Translators for LR(k) languages”, Ph.D Dissertation, Department of Electrical Engineering, Massachusetts Institute of Technology, Cambridge, 1969.  2

  4. David Pager, “A Practical General Method for Constructing LR(k) Parsers”, Acta Informatica 7:249-268, 1977. 

  5. hocc lacks two relevant bison features: intermediate reductions and nonassoc associativity. Intermediate reductions were simulated via equivalent grammar modifications. nonassoc associativity is in my opinion a misfeature, so rather than implementing it in hocc I removed all uses from the input grammars and verified that doing so had no relevant impact on bison’s parser generation. 

  6. Lukas Diekmann and Laurence Tratt, “Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers,” 34th European Conference on Object-Oriented Programming (ECOOP 2020), Article No. 6, pages 6:1–6:32.