More Precise Documentation

In a recent post I wrote about precise documentation leading to Hemlock language syntax refinements that had previously been overlooked. That same process has played out during hocc parser generator design, and in one case it led me to independently discover a flaw in the Parsing module that serves as the inspiration for hocc. Settle in for a couple of satisfying design stories.

Non-associative precedence

Parser generators going back at least as far as yacc have included mechanisms for conflict resolution. A conflict occurs due to grammar specification ambiguity that permits more than one syntax tree for a given input. In the most degenerate cases conflict resolution can be as arbitrary as preferring the grammar rule which was specified first [¹], but parser generators also support associative operator precedence specifications.

%left: Prefer e.g. (2 + 3) + 4.
%right: Prefer e.g. 2 + (3 + 4).
%nonassoc: Remove all rules that would allow multiple uses of an operator.

Any student of Sesame Street will recognize that one of these things is not like the others. Some modern parser generators make it possible to assign precedence without specifying associativity, e.g. bison’s %predecence and the Parsing module’s %fail, but I had a hard time figuring out why anyone would ever want %nonassoc, even absent %precedence/%fail. Apparently we can blame the Fortran language, viz:

[…] the keyword %nonassoc is used to describe operators, like the operator .LT. in Fortran, that may not associate with themselves; thus,
A  .LT.  B  .LT.  C
is illegal in Fortran, and such an operator would be described with the keyword %nonassoc in Yacc.

Unfair jabs at Fortran aside, I think the real problem was that there was no other way to specify absence of associativity, and turning this into a parse-time error dodged the dangers of arbitrary conflict resolution.

It turns out that even though the Parsing module implements %nonassoc [²], I always used %fail. So then, should hocc even support %nonassoc semantics? No.

Implicit ⟂ for start symbols

Parser generators need to know which non-terminal symbols can be used as start symbols, that is, as syntax tree roots. There’s a slightly weird edge case with regard to LR(1) parsers that comes into play when completing a parse. The 1 in LR(1) means “one symbol of lookahead”, and at the end of input there are no more symbols, hence no lookahead. The common way to deal with this is to append a synthetic ⟂ (end of input) symbol to each production for the start symbol.

In the Parsing module there is a special eoi() method that the application calls to indicate the end of input, but for hocc this approach wouldn’t have worked well. hocc generates a Token.t variant type that would have to include a constructor for EOI. For example, the Token module corresponding to the example in this earlier blog post would look like:

Token = {
    type t: t =
      | PLUS of unit
      | MINUS of unit
      | STAR of unit
      | SLASH of unit
      | INT of Int.t
      | EOI of unit
      | EPSILON of unit

    pp >e: t -> Fmt.Formatter e >e-> Fmt.Formatter e
  }

What if the scanner were to generate an EOI token with non-unit payload? This is an actual issue for Hemlock’s scanner as currently written, and neither of the solutions I was contemplating appealed:

Hard-code EOI of unit, and require the scanner to conform, via manual conversion if necessary.
Allow the parser specification to explicitly define the special EOI constructor with a non-unit payload.

When faced with messy API choices, the best option is often to omit the API entirely. There turns out to be an elegant solution related to how Menhir handles the end of input. From Section 6.4 — End-of-stream conflicts, where # is equivalent to ⟂:

Technically, Menhir proceeds as follows. A # symbol is introduced. It is, however, only a pseudo-token: it is never produced by the lexical analyzer. For each start symbol S in the original grammar, a new start symbol S’ is defined, together with the production S′→ S. The corresponding start state of the LR(1) automaton is composed of the LR(1) item S′ → . S [# ]. That is, the pseudo-token # initially appears in the lookahead set, indicating that we expect to be done after recognizing an S-sentence.

Menhir generates parsers that preemptively reduce start productions as soon as possible. Not only does this avoid the need for feeding a special EOI token to parsers, it makes parsers more versatile. Menhir is full of gems.

Footnotes

Although parser generators commonly implement (semi-arbitrary) conflict resolution as a last resort for any and all conflicts, it’s a terrible approach that can, and does, lead to flawed parsers. ↩
The Parsing module had a long-undiscovered %nonassoc bug, fixed in 2019. There is probably still the potential for incorrect conflict resolution if %nonassoc and %fail are used together. ↩

BranchTaken

Hemlock language insights

More Precise Documentation

Non-associative precedence

Implicit ⟂ for start symbols

Footnotes