BranchTaken

Hemlock language insights

Resolution Convolution

Horseshoe Bend

The hocc parser generator is incrementally converging on a working IELR(1) [1] implementation, but it is a convoluted process, putting things mildly. I’m pretty sure this showed up on my screen late one night:

YOU ARE IN A MAZE OF TWISTY LITTLE STATES, ALL ALIKE. [2]

Read on for the latest battle report. May the human win in the end!

Dependencies, schmependencies

The promise of IELR(1) is to provide the full power of LR(1) [3], but without the typical order-of-magnitude difference in parser state machine size relative to LALR(1) [4]. The core idea [5] is to start with LALR(1) parser states, analyze their inadequacies relative to LR(1), and split just enough states to recover the full power of LR(1) where it was lost during blindly optimistic LALR(1) state merging. What does an inadequacy look like? Well, it is a conflict that wouldn’t exist in an LR(1) parser, e.g. the dreaded LALR(1)-induced reduce-reduce conflict that parser developers have been suffering from for decades. Okay, inadequacies are pretty easy to spot, but how does IELR(1) figure out which states to split? I will spare you the gory details (never mind that I’m still trying to grok some of those details), but the basic idea is to create a dependency graph of states that contribute to inadequacies.

Given the right dependency graph (a non-trivial ask) and sufficiently advanced technology (how hard can it be?), IELR(1) splits a very limited set of states in practice. What I’ve managed so far is to build a slightly conservative dependency graph, and given that graph split nearly all potentially contributing states, regardless of whether splitting is truly necessary. Surprisingly, that’s good enough to replicate the results for all the example grammars in the IELR(1) paper.

But it’s nowhere near good enough for more involved grammars. I continue to test hocc with the Lyken [6] grammar, which I developed using the PGM(1) [7] algorithm. That grammar has 88 precedences, 101 tokens, 129 non-terminals, and 692 productions, which results in 1,375 parser states with no unresolved conflicts. Interestingly, the LALR(1) algorithm generates an identical parser, which implies that IELR(1) should also generate an identical parser. But the best hocc’s “IELR(1)” implementation has done so far is 9,473 states. Clearly there’s room for improvement, though to be fair, 9,473 is quite a bit less than the 19,590 states generated by the canonical LR(1) algorithm.

A new (ir)resolution

One of IELR(1)’s interesting nuances is that it only splits states if a conflict wouldn’t otherwise have been generated. That’s a good thing, because splitting conflicted states inherent to the grammar could cause conflict duplication, thus making diagnosis even harder than usual. But things get interesting when you consider that grammar specifications commonly include precedence and/or associativity annotations that allow the parser generator to resolve what would otherwise be conflicts. What if we disable conflict resolution? That could mean conflicts, in states that IELR(1) didn’t even need to create in the conflict-resolved parser. In LR(1) parser tables we would see the same multitude of states regardless; the only difference would be where the conflicts are.

I’ve repeatedly found myself creating two versions of hocc grammars, one with precedence/associativity specifications, and the other without, in order to analyze their effects. This has been such a common need that I finally added the -resolve command line parameter to hocc, so that conflict resolution can be disabled without having to maintain two variants of the grammar. For example, invoking hocc on the grammar from Figure 1 of the IELR(1) paper [8] generates different output if -resolve no is specified.

> hocc -v -src IelrFig1
hocc: Parsing "./IelrFig1.hmh"
hocc: Generating IELR(1) specification
hocc: 1 precedence, 4 tokens, 3 non-terminals, 5 productions
hocc: Generating LALR(1) specification as IELR(1) prerequisite
hocc: LR(1) item set compatibility: lalr1
hocc: Generating LR(1) item set closures+++++.+++++
hocc: Generating 11 LR(1) states
hocc: 1 conflict (0 ⊥, 1 shift-reduce, 0 reduce-reduce) (conflict resolution disabled)
hocc: 1 conflict-contributing state
hocc: LR(1) item set compatibility: ielr1
hocc: Generating LR(1) item set closures+++++*++++.+
hocc: Generating 12 LR(1) states
hocc: 0 unresolvable conflicts
hocc: Searching for unused precedences/tokens/non-terminals/productions

> hocc -v -src IelrFig1 -resolve no
hocc: Parsing "./IelrFig1.hmh"
hocc: Generating IELR(1) specification
hocc: 1 precedence, 4 tokens, 3 non-terminals, 5 productions
hocc: Generating LALR(1) specification as IELR(1) prerequisite
hocc: LR(1) item set compatibility: lalr1
hocc: Generating LR(1) item set closures+++++.+++++
hocc: Generating 11 LR(1) states
hocc: 1 conflict (0 ⊥, 1 shift-reduce, 0 reduce-reduce) (conflict resolution disabled)
hocc: 1 conflict-contributing state
hocc: LR(1) item set compatibility: ielr1
hocc: Generating LR(1) item set closures+++++*++++.+
hocc: Generating 12 LR(1) states
hocc: 1 conflict (0 ⊥, 1 shift-reduce, 0 reduce-reduce) (conflict resolution disabled)
hocc: Searching for unused precedences/tokens/non-terminals/productions

Citations/footnotes

  1. Joel E. Denny and Brian A. Malloy, “The IELR(1) algorithm for generating minimal LR(1) parser tables for non-LR(1) grammars with conflict resolution”, Science of Computer Programming, 75(11):943-979, 2010. 

  2. Original: YOU ARE IN A MAZE OF TWISTY LITTLE PASSAGES, ALL ALIKE. 

  3. Donald E. Knuth, “On the Translation of Languages from Left to Right”, Information and Control 8:607-639, 1965. 

  4. Frank DeRemer, “Practical Translators for LR(k) languages”, Ph.D Dissertation, Department of Electrical Engineering, Massachusetts Institute of Technology, Cambridge, 1969. 

  5. Core idea: Not to be confused with the LR(0) core of an LR(1) item set, nor the even more intimidating isocore

  6. Lyken was an uncompleted programming language that incubated Parsing

  7. David Pager, “A Practical General Method for Constructing LR(k) Parsers”, Acta Informatica 7:249-268, 1977. 

  8. hocc
        left p
    
        token Ta prec p
        token Tb
    
        start S ::=
          | Ta A Ta prec p
          | Tb A Tb
    
        nonterm A prec p ::=
          | Ta
          | Ta Ta