Resolution Convolution
The hocc
parser generator is
incrementally converging on a working IELR(1) [1] implementation, but it is a convoluted
process, putting things mildly. I’m pretty sure this showed up on my screen late one night:
YOU ARE IN A MAZE OF TWISTY LITTLE STATES, ALL ALIKE. [2]
Read on for the latest battle report. May the human win in the end!
Dependencies, schmependencies
The promise of IELR(1) is to provide the full power of LR(1) [3], but without the typical order-of-magnitude difference in parser state machine size relative to LALR(1) [4]. The core idea [5] is to start with LALR(1) parser states, analyze their inadequacies relative to LR(1), and split just enough states to recover the full power of LR(1) where it was lost during blindly optimistic LALR(1) state merging. What does an inadequacy look like? Well, it is a conflict that wouldn’t exist in an LR(1) parser, e.g. the dreaded LALR(1)-induced reduce-reduce conflict that parser developers have been suffering from for decades. Okay, inadequacies are pretty easy to spot, but how does IELR(1) figure out which states to split? I will spare you the gory details (never mind that I’m still trying to grok some of those details), but the basic idea is to create a dependency graph of states that contribute to inadequacies.
Given the right dependency graph (a non-trivial ask) and sufficiently advanced technology (how hard can it be?), IELR(1) splits a very limited set of states in practice. What I’ve managed so far is to build a slightly conservative dependency graph, and given that graph split nearly all potentially contributing states, regardless of whether splitting is truly necessary. Surprisingly, that’s good enough to replicate the results for all the example grammars in the IELR(1) paper.
But it’s nowhere near good enough for more involved grammars. I continue to test hocc
with the
Lyken [6] grammar, which I developed using the PGM(1) [7] algorithm. That grammar
has 88 precedences, 101 tokens, 129 non-terminals, and 692 productions, which results in 1,375
parser states with no unresolved conflicts. Interestingly, the LALR(1) algorithm generates an
identical parser, which implies that IELR(1) should also generate an identical parser. But the best
hocc
’s “IELR(1)” implementation has done so far is 9,473 states. Clearly there’s room for
improvement, though to be fair, 9,473 is quite a bit less than the 19,590 states generated by the
canonical LR(1) algorithm.
A new (ir)resolution
One of IELR(1)’s interesting nuances is that it only splits states if a conflict wouldn’t otherwise have been generated. That’s a good thing, because splitting conflicted states inherent to the grammar could cause conflict duplication, thus making diagnosis even harder than usual. But things get interesting when you consider that grammar specifications commonly include precedence and/or associativity annotations that allow the parser generator to resolve what would otherwise be conflicts. What if we disable conflict resolution? That could mean conflicts, in states that IELR(1) didn’t even need to create in the conflict-resolved parser. In LR(1) parser tables we would see the same multitude of states regardless; the only difference would be where the conflicts are.
I’ve repeatedly found myself creating two versions of hocc
grammars, one with
precedence/associativity specifications, and the other without, in order to analyze their effects.
This has been such a common need that I finally added the -resolve
command line parameter to
hocc
, so that conflict resolution can be disabled without having to maintain two variants of the
grammar. For example, invoking hocc
on the grammar from Figure 1 of the IELR(1) paper
[8] generates different output if -resolve no
is specified.
> hocc -v -src IelrFig1
hocc: Parsing "./IelrFig1.hmh"
hocc: Generating IELR(1) specification
hocc: 1 precedence, 4 tokens, 3 non-terminals, 5 productions
hocc: Generating LALR(1) specification as IELR(1) prerequisite
hocc: LR(1) item set compatibility: lalr1
hocc: Generating LR(1) item set closures+++++.+++++
hocc: Generating 11 LR(1) states
hocc: 1 conflict (0 ⊥, 1 shift-reduce, 0 reduce-reduce) (conflict resolution disabled)
hocc: 1 conflict-contributing state
hocc: LR(1) item set compatibility: ielr1
hocc: Generating LR(1) item set closures+++++*++++.+
hocc: Generating 12 LR(1) states
hocc: 0 unresolvable conflicts
hocc: Searching for unused precedences/tokens/non-terminals/productions
> hocc -v -src IelrFig1 -resolve no
hocc: Parsing "./IelrFig1.hmh"
hocc: Generating IELR(1) specification
hocc: 1 precedence, 4 tokens, 3 non-terminals, 5 productions
hocc: Generating LALR(1) specification as IELR(1) prerequisite
hocc: LR(1) item set compatibility: lalr1
hocc: Generating LR(1) item set closures+++++.+++++
hocc: Generating 11 LR(1) states
hocc: 1 conflict (0 ⊥, 1 shift-reduce, 0 reduce-reduce) (conflict resolution disabled)
hocc: 1 conflict-contributing state
hocc: LR(1) item set compatibility: ielr1
hocc: Generating LR(1) item set closures+++++*++++.+
hocc: Generating 12 LR(1) states
hocc: 1 conflict (0 ⊥, 1 shift-reduce, 0 reduce-reduce) (conflict resolution disabled)
hocc: Searching for unused precedences/tokens/non-terminals/productions
Citations/footnotes
-
Joel E. Denny and Brian A. Malloy, “The IELR(1) algorithm for generating minimal LR(1) parser tables for non-LR(1) grammars with conflict resolution”, Science of Computer Programming, 75(11):943-979, 2010. ↩
-
Original: YOU ARE IN A MAZE OF TWISTY LITTLE PASSAGES, ALL ALIKE. ↩
-
Donald E. Knuth, “On the Translation of Languages from Left to Right”, Information and Control 8:607-639, 1965. ↩
-
Frank DeRemer, “Practical Translators for LR(k) languages”, Ph.D Dissertation, Department of Electrical Engineering, Massachusetts Institute of Technology, Cambridge, 1969. ↩
-
Core idea: Not to be confused with the LR(0) core of an LR(1) item set, nor the even more intimidating isocore. ↩
-
Lyken was an uncompleted programming language that incubated Parsing. ↩
-
David Pager, “A Practical General Method for Constructing LR(k) Parsers”, Acta Informatica 7:249-268, 1977. ↩
-
hocc left p token Ta prec p token Tb start S ::= | Ta A Ta prec p | Tb A Tb nonterm A prec p ::= | Ta | Ta Ta