Recognizing Follow Sets
The hocc
parser generator
implementation continues to progress, and very nearly all the pieces are in place for computing what
the literature commonly calls the “canonical collection of LR(1) items” or 𝐶𝐶 for short
[1]. 𝐶𝐶 is more accurately described as a set closure of LR(1) item set closures, but
there are sets in the sets in the sets, so 𝐶𝐶 seems a reasonable compromise between descriptiveness
and precision. Nonetheless, I discovered an implementation optimization for hocc
that escaped me
for years in the context of the Python-based Parsing
parser
generator, and it was due at least in part to
imprecision in the Dragon Book’s [2] handling of lookahead sets that I didn’t fully grok
the algorithms.
In the recent Parsing Retrospective blog post I joked
that it is easier to copy my own work than that of others, but in the case of Parsing
this has
proven less true than I had expected. That code shows deep evidence of how my understanding of
parser generation expanded during implementation. Concepts and data structures are muddled, e.g.
there is a single Item
class which functions as an LR(0) item and/or an LR(1) item, where these
really should be distinct. Perhaps this explains how there ended up being a String
class as the
representation of an LR(1) item suffix that was in turn used as the key for a first set cache. A
String
should have simply been an LR(1) item. And if I had gotten that right, perhaps I would have
recognized that what the literature calls a “lookahead symbol” is merely an element of the “follow
set”. Without loss of generality we can represent all LR(1) items based on a specific LR(0) item by
creating a union of their lookahead symbols, i.e. a follow set. And then when we compute the first
sets of LR(1) items, we can do so for an entire follow set at once, rather than iteratively for each
lookahead symbol in the follow set.
This relationship between lookahead and follow sets is easy to miss in all the the descriptions I have read, both printed [1, 2] and online [3]. Following is an illustrative quote from page 233 of the Dragon Book [2]:
[…] we use the notation [C 🠂 ·cC, c/d] as a shorthand for the two items [C 🠂 ·cC, c] and [C 🠂 ·cC, d].
Let’s back up just a tiny bit and pick apart the distinction between LR(0) items and LR(1) items, using the quote as an example. [C 🠂 ·cC] is an LR(0) item, where C is a non-terminal symbol, c is a terminal symbol (i.e. token) and the dot represents how much of the righthand side of the production C 🠂 cC has been matched thus far. In this case the dot indicates that no matching has yet occurred. The LR(1) items [C 🠂 ·cC, c] and [C 🠂 ·cC, d] augment the LR(0) item with a lookahead symbol, c or d in this example. And I would write the convenient notation [C 🠂 ·cC, c/d] as [C ::= ·cC, {c,d}], but the idea is clear in either notation that a set of LR(1) items with identical LR(0) items can be combined by treating the union of their lookahead symbols as a follow set.
That’s the key insight, and with that it suddenly becomes clear that the general algorithm for computing follow sets applies. Why does this matter? 𝐶𝐶 computation is expensive, in part because the naïve implementation is a quintuply-nested loop. But the innermost loop computes first sets for LR(1) items with identical LR(0) items! We can reduce 𝐶𝐶 computation to a quadruply-nested loop.
I modified my divergent implementation of Parsing
to verify that this approach is valid, and it
sped up end-to-end parser generation time by ~1.2X. The performance increase would be even larger
absent numerous other optimizations I won’t get into the details of, but it’s worth emphasizing that
Parsing
performance was an ongoing problem, and I tried really hard to optimize it in every way
I could imagine, yet I missed this.
How does the literature consistently end up obscuring such a basic concept? My hypothesis is that when explaining LR(1), it is compelling to first introduce LR(k) in broad strokes, then treat LR(1) as a special case of LR(k). In the context of LR(1) it’s easy for us to see that the union of lookahead symbols c and d is {c,d}, but suppose we’re dealing with LR(3), in which case there’s the potential for a combinatorial explosion in the lookahead, e.g. {ccc,ccd,cdc,…}. Such follow sets complicate the general exposition, so the focus is on particular lookaheads rather than follow sets.