BranchTaken

Hemlock language insights

Recognizing Follow Sets

First set

The hocc parser generator implementation continues to progress, and very nearly all the pieces are in place for computing what the literature commonly calls the “canonical collection of LR(1) items” or 𝐶𝐶 for short [1]. 𝐶𝐶 is more accurately described as a set closure of LR(1) item set closures, but there are sets in the sets in the sets, so 𝐶𝐶 seems a reasonable compromise between descriptiveness and precision. Nonetheless, I discovered an implementation optimization for hocc that escaped me for years in the context of the Python-based Parsing parser generator, and it was due at least in part to imprecision in the Dragon Book’s [2] handling of lookahead sets that I didn’t fully grok the algorithms.

In the recent Parsing Retrospective blog post I joked that it is easier to copy my own work than that of others, but in the case of Parsing this has proven less true than I had expected. That code shows deep evidence of how my understanding of parser generation expanded during implementation. Concepts and data structures are muddled, e.g. there is a single Item class which functions as an LR(0) item and/or an LR(1) item, where these really should be distinct. Perhaps this explains how there ended up being a String class as the representation of an LR(1) item suffix that was in turn used as the key for a first set cache. A String should have simply been an LR(1) item. And if I had gotten that right, perhaps I would have recognized that what the literature calls a “lookahead symbol” is merely an element of the “follow set”. Without loss of generality we can represent all LR(1) items based on a specific LR(0) item by creating a union of their lookahead symbols, i.e. a follow set. And then when we compute the first sets of LR(1) items, we can do so for an entire follow set at once, rather than iteratively for each lookahead symbol in the follow set.

This relationship between lookahead and follow sets is easy to miss in all the the descriptions I have read, both printed [1, 2] and online [3]. Following is an illustrative quote from page 233 of the Dragon Book [2]:

[…] we use the notation [C 🠂 ·cC, c/d] as a shorthand for the two items [C 🠂 ·cC, c] and [C 🠂 ·cC, d].

Let’s back up just a tiny bit and pick apart the distinction between LR(0) items and LR(1) items, using the quote as an example. [C 🠂 ·cC] is an LR(0) item, where C is a non-terminal symbol, c is a terminal symbol (i.e. token) and the dot represents how much of the righthand side of the production C 🠂 cC has been matched thus far. In this case the dot indicates that no matching has yet occurred. The LR(1) items [C 🠂 ·cC, c] and [C 🠂 ·cC, d] augment the LR(0) item with a lookahead symbol, c or d in this example. And I would write the convenient notation [C 🠂 ·cC, c/d] as [C ::= ·cC, {c,d}], but the idea is clear in either notation that a set of LR(1) items with identical LR(0) items can be combined by treating the union of their lookahead symbols as a follow set.

That’s the key insight, and with that it suddenly becomes clear that the general algorithm for computing follow sets applies. Why does this matter? 𝐶𝐶 computation is expensive, in part because the naïve implementation is a quintuply-nested loop. But the innermost loop computes first sets for LR(1) items with identical LR(0) items! We can reduce 𝐶𝐶 computation to a quadruply-nested loop.

I modified my divergent implementation of Parsing to verify that this approach is valid, and it sped up end-to-end parser generation time by ~1.2X. The performance increase would be even larger absent numerous other optimizations I won’t get into the details of, but it’s worth emphasizing that Parsing performance was an ongoing problem, and I tried really hard to optimize it in every way I could imagine, yet I missed this.

How does the literature consistently end up obscuring such a basic concept? My hypothesis is that when explaining LR(1), it is compelling to first introduce LR(k) in broad strokes, then treat LR(1) as a special case of LR(k). In the context of LR(1) it’s easy for us to see that the union of lookahead symbols c and d is {c,d}, but suppose we’re dealing with LR(3), in which case there’s the potential for a combinatorial explosion in the lookahead, e.g. {ccc,ccd,cdc,…}. Such follow sets complicate the general exposition, so the focus is on particular lookaheads rather than follow sets.

Citations/footnotes

  1. Keith D. Cooper and Linda Torczon, “Engineering a Compiler, Second Edition”, Morgan Kaufmann, 2012.  2

  2. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, “Compilers: Principles, Techniques, and Tools” aka the “Dragon Book”, Addison-Wesley, 1986.  2 3

  3. Canonical LR Parser (retrieved version), Wikipedia