BranchTaken

Hemlock language insights

Parsing Retrospective

Dynamo

Hemlock needs a parser, and since it will use LR(1) parsing, that means using a parser generator. Hemlock will be self-hosting, which precludes using an existing parser generator, ergo the creation of hocc. Fortunately I’ve already reinvented the wheel once (that’s the way I roll) in the form of the Python-based Parsing module, and it’s a lot easier to copy my own work than that of others. Joking aside, the Parsing module synthesizes the state of the art in parser generation as of fifteen years ago, and it has one still-distinctive feature that makes developing reliable parsers much more tractable. This post is a Parsing retrospective that serves as background information for hocc development.

Before Parsing

A long series of projects led to Parsing, and in retrospect some of those projects were overly ambitious. But hey, I learned a lot and some of the software ended up being broadly useful.

  1. As an undergraduate computer science student I was assigned a senior group project that required generating PostScript, a postfix printer language which is the spiritual ancestor of the Portable Document Format (PDF). I had already become enamored with postfix languages thanks to my trusty HP 48GX calculator. Add to that a growing fascination for embeddable languages like Tcl, and a few years later I developed the Onyx programming language.
  2. As a graduate student developing the Crux bioinformatics software package, I needed flexible software for gluing together experimental pipelines. Onyx was the obvious choice because hey, I wrote it! But if you’ve ever tried to write large systems in a stack-based language, you probably know what went wrong — stack management bugs caused non-local failures. Enter Python. All was well… for a while.
  3. Adding a feature to the Python version of Crux was a day of bliss, followed by a week or two of soul-crushing conversion to a natively compiled module written in C. The problem I kept running into was that although Python was elegant for my purposes, it was way too slow for real-world dataset sizes, like, two orders of magnitude too slow.
  4. After the umpteenth rewrite of a Python module in C, I snapped and decided to write a gradually typed language called Lyken. I never completed Lyken, however some pieces of Lyken technology survive.

    • jemalloc started out as Lyken’s memory allocator. It’s funny how poor a fit the allocator was for Lyken’s actual needs, but those ~400 hours of potentially wasted effort were salvaged thanks to several thousand more hours of work, initially by myself, and over time by many others.

    • Parsing, the subject of this post.

Enter Parsing

Let’s back up just a tiny bit. I didn’t just start writing a parser generator, because that’s capital-H Hard (TM). I used the most excellent Lemon parser generator. I remember Lemon fondly for a few reasons:

Why did I leave Lemon behind? Well, the bootstrap Lyken implementation was primarily written in Python, and marshaling the C-based parsing results into Python data structures was both brittle and slow. Once the Lyken grammar reached a few hundred productions the overhead became untenable. Next steps:

  1. Dust off the (red) Dragon Book.
  2. ???
  3. Profit!

That was the idea, anyway. I wrote several blog posts during this adventure, so I’ll avoid repeating myself (much) and just add some retrospective commentary here.

Current Parsing

Parsing lives on in EdgeDB, and the developers continue to maintain it as a discrete repository that is available as a PyPI package.

It bears mentioning only for completeness that I continued to develop Parsing for my own needs in Crux, and thereafter in Lyken (unreleased). Parser generation time was an ongoing issue for Lyken, and I eventually resorted to using Cython to gain a ~10X speedup. However, Cython continued to evolve, and version 0.11.3 (yes, all the way back in 2009) is the last that works with this fork of Parsing.

Citations/footnotes

  1. In the comments you will see that prior to implementing the Pager algorithm I was terminating multi-day parser generation attempts. Nearly two years after I posted, Paulo Barreto came along and determined that I must have done something wrong. He was right. Prior to adopting Pager’s algorithm I was generating states in depth-first order, without any memoization mechanism that would have avoided identical states. This allowed an unnecessary combinatorial explosion of states.

    How did I miss such a basic problem? Well, this was over 15 years ago, so I’m not sure, though I do recall rationalizing the terrible performance as an outcome of exponential time and space complexity. I suspect I didn’t have a clear understanding of the tractable algorithmic complexity of LR(1), versus the potentially intractable algorithmic complexity of LR(k).

    Let’s put some numbers to how things might have gone absent my canonical LR(1) implementation mistake. The last time I worked on Lyken was in 2013, but with a bit of effort I managed to get its pieces compiling. The last version of its grammar comprised 98 tokens, 128 non-terminals, and 692 productions. The Parsing module generates a parser with 1374 states, and with minor modification it generates a canonical LR(1) parser with 19589 states. Ouch, that’s over 14 times as many states, and it takes ~10 times as long to generate, but I probably would have just dealt with that, perhaps spent some time optimizing the code to make it fast enough to be tolerable. Oh well, it was a rough ride, but in retrospect I’m glad to have taken this twisty path to enlightenment. 

  2. David Pager, “A practical general method for constructing LR(k) parsers”, Acta Informatica, 7:249–268, 1977.