Parsing Retrospective

Hemlock needs a parser, and since it will use LR(1) parsing, that means using a parser generator. Hemlock will be self-hosting, which precludes using an existing parser generator, ergo the creation of hocc. Fortunately I’ve already reinvented the wheel once (that’s the way I roll) in the form of the Python-based Parsing module, and it’s a lot easier to copy my own work than that of others. Joking aside, the Parsing module synthesizes the state of the art in parser generation as of fifteen years ago, and it has one still-distinctive feature that makes developing reliable parsers much more tractable. This post is a Parsing retrospective that serves as background information for hocc development.

Before Parsing

A long series of projects led to Parsing, and in retrospect some of those projects were overly ambitious. But hey, I learned a lot and some of the software ended up being broadly useful.

As an undergraduate computer science student I was assigned a senior group project that required generating PostScript, a postfix printer language which is the spiritual ancestor of the Portable Document Format (PDF). I had already become enamored with postfix languages thanks to my trusty HP 48GX calculator. Add to that a growing fascination for embeddable languages like Tcl, and a few years later I developed the Onyx programming language.
As a graduate student developing the Crux bioinformatics software package, I needed flexible software for gluing together experimental pipelines. Onyx was the obvious choice because hey, I wrote it! But if you’ve ever tried to write large systems in a stack-based language, you probably know what went wrong — stack management bugs caused non-local failures. Enter Python. All was well… for a while.
Adding a feature to the Python version of Crux was a day of bliss, followed by a week or two of soul-crushing conversion to a natively compiled module written in C. The problem I kept running into was that although Python was elegant for my purposes, it was way too slow for real-world dataset sizes, like, two orders of magnitude too slow.
After the umpteenth rewrite of a Python module in C, I snapped and decided to write a gradually typed language called Lyken. I never completed Lyken, however some pieces of Lyken technology survive.
- jemalloc started out as Lyken’s memory allocator. It’s funny how poor a fit the allocator was for Lyken’s actual needs, but those ~400 hours of potentially wasted effort were salvaged thanks to several thousand more hours of work, initially by myself, and over time by many others.
- Parsing, the subject of this post.

Enter Parsing

Let’s back up just a tiny bit. I didn’t just start writing a parser generator, because that’s capital-H Hard (TM). I used the most excellent Lemon parser generator. I remember Lemon fondly for a few reasons:

Production parameters are named rather than numbered. This seems like a small detail, but it’s infuriating to track down a bug that was simply due to slightly changing a production and failing to manually renumber every parameter reference within the reduction code.
The diagnostic output is excellent, a huge help when diagnosing grammar flaws. This strongly influenced the level of detail in the log files generated by Parsing.
Parsers were reentrant by default, and integrating with a hand-written scanner was straightforward.

Why did I leave Lemon behind? Well, the bootstrap Lyken implementation was primarily written in Python, and marshaling the C-based parsing results into Python data structures was both brittle and slow. Once the Lyken grammar reached a few hundred productions the overhead became untenable. Next steps:

Dust off the (red) Dragon Book.
???
Profit!

That was the idea, anyway. I wrote several blog posts during this adventure, so I’ll avoid repeating myself (much) and just add some retrospective commentary here.

Deprogramming the LALR(1) bias: I went through the process of implementing SLR(1) and then canonical LR(1) parser generation, and I couldn’t figure out why the Dragon Book promoted LALR(1). The reader comments were absolute gold, and they incidentally saved me from failure [¹]. This is how I learned about Menhir and the then-obscure Pager algorithm [²].
Parser generation algorithms of a bygone era: In addition to summarizing parser generator algorithms, this post discussed some of the GLR parsing algorithms I was working with. That proved to be a dead end, a negative result to perhaps expound on another time.
More forgotten parser generation algorithms: Modern parser generators use precedence to resolve ambiguities, but in my considered opinion, only Parsing provides the precision and power one should expect. I have a lot more to say on this topic, which will have to wait for another time.
Parsing.py parser generator is now available
A simple Parsing-based parser example
Why the overwhelming silence?: At the time I wrote Parsing there weren’t many options for robust parsing in Python, and I naïvely imagined that it would be broadly adopted. Alas, it was not to be. The reader comments were a good lesson on marketing.

Current Parsing

Parsing lives on in EdgeDB, and the developers continue to maintain it as a discrete repository that is available as a PyPI package.

It bears mentioning only for completeness that I continued to develop Parsing for my own needs in Crux, and thereafter in Lyken (unreleased). Parser generation time was an ongoing issue for Lyken, and I eventually resorted to using Cython to gain a ~10X speedup. However, Cython continued to evolve, and version 0.11.3 (yes, all the way back in 2009) is the last that works with this fork of Parsing.

Citations/footnotes

In the comments you will see that prior to implementing the Pager algorithm I was terminating multi-day parser generation attempts. Nearly two years after I posted, Paulo Barreto came along and determined that I must have done something wrong. He was right. Prior to adopting Pager’s algorithm I was generating states in depth-first order, without any memoization mechanism that would have avoided identical states. This allowed an unnecessary combinatorial explosion of states.

How did I miss such a basic problem? Well, this was over 15 years ago, so I’m not sure, though I do recall rationalizing the terrible performance as an outcome of exponential time and space complexity. I suspect I didn’t have a clear understanding of the tractable algorithmic complexity of LR(1), versus the potentially intractable algorithmic complexity of LR(k).

Let’s put some numbers to how things might have gone absent my canonical LR(1) implementation mistake. The last time I worked on Lyken was in 2013, but with a bit of effort I managed to get its pieces compiling. The last version of its grammar comprised 98 tokens, 128 non-terminals, and 692 productions. The Parsing module generates a parser with 1374 states, and with minor modification it generates a canonical LR(1) parser with 19589 states. Ouch, that’s over 14 times as many states, and it takes ~10 times as long to generate, but I probably would have just dealt with that, perhaps spent some time optimizing the code to make it fast enough to be tolerable. Oh well, it was a rough ride, but in retrospect I’m glad to have taken this twisty path to enlightenment. ↩
David Pager, “A practical general method for constructing LR(k) parsers”, Acta Informatica, 7:249–268, 1977. ↩

BranchTaken

Hemlock language insights

Parsing Retrospective

Before Parsing

Enter Parsing

Current Parsing

Citations/footnotes