More Precise Documentation
In a recent post I wrote about precise
documentation leading to Hemlock language syntax refinements that had previously been overlooked.
That same process has played out during hocc
parser
generator design, and in one case it led
me to independently discover a flaw in the Parsing
module that serves as the inspiration for hocc
.
Settle in for a couple of satisfying design stories.
Non-associative precedence
Parser generators going back at least as far as yacc
have
included mechanisms for conflict resolution. A conflict occurs due to grammar specification
ambiguity that permits more than one syntax tree for a given input. In the most degenerate cases
conflict resolution can be as arbitrary as preferring the grammar rule which was specified first
[1], but parser generators also support associative operator precedence specifications.
%left
: Prefer e.g.(2 + 3) + 4
.%right
: Prefer e.g.2 + (3 + 4)
.%nonassoc
: Remove all rules that would allow multiple uses of an operator.
Any student of Sesame Street will recognize that one
of these things is not like the others. Some modern
parser generators make it possible to assign precedence without specifying associativity, e.g.
bison’s %predecence
and the Parsing
module’s %fail
, but
I had a hard time figuring out why anyone would ever want %nonassoc
, even absent
%precedence
/%fail
. Apparently we can blame the Fortran
language, viz:
[…] the keyword
%nonassoc
is used to describe operators, like the operator.LT.
in Fortran, that may not associate with themselves; thus,A .LT. B .LT. C
is illegal in Fortran, and such an operator would be described with the keyword
%nonassoc
in Yacc.
Unfair jabs at Fortran aside, I think the real problem was that there was no other way to specify absence of associativity, and turning this into a parse-time error dodged the dangers of arbitrary conflict resolution.
It turns out that even though the Parsing
module implements %nonassoc
[2],
I always used %fail
. So then, should hocc
even support %nonassoc
semantics? No.
Implicit ⟂ for start symbols
Parser generators need to know which non-terminal symbols can be used as start symbols, that is, as syntax tree roots. There’s a slightly weird edge case with regard to LR(1) parsers that comes into play when completing a parse. The 1 in LR(1) means “one symbol of lookahead”, and at the end of input there are no more symbols, hence no lookahead. The common way to deal with this is to append a synthetic ⟂ (end of input) symbol to each production for the start symbol.
In the Parsing
module there is a special eoi()
method that the application calls to indicate the
end of input, but for hocc
this approach wouldn’t have worked well. hocc
generates a Token.t
variant type that would have to include a constructor for EOI
. For example, the Token
module
corresponding to the example in this earlier blog
post would look like:
Token = {
type t: t =
| PLUS of unit
| MINUS of unit
| STAR of unit
| SLASH of unit
| INT of Int.t
| EOI of unit
| EPSILON of unit
pp >e: t -> Fmt.Formatter e >e-> Fmt.Formatter e
}
What if the scanner were to generate an EOI
token with non-unit
payload? This is an actual issue
for Hemlock’s scanner as currently written, and neither of the solutions I was contemplating
appealed:
- Hard-code
EOI of unit
, and require the scanner to conform, via manual conversion if necessary. - Allow the parser specification to explicitly define the special
EOI
constructor with a non-unit
payload.
When faced with messy API choices, the best option is often to omit the API entirely. There turns out to be an elegant solution related to how Menhir handles the end of input. From Section 6.4 — End-of-stream conflicts, where # is equivalent to ⟂:
Technically, Menhir proceeds as follows. A # symbol is introduced. It is, however, only a pseudo-token: it is never produced by the lexical analyzer. For each start symbol S in the original grammar, a new start symbol S’ is defined, together with the production S′→ S. The corresponding start state of the LR(1) automaton is composed of the LR(1) item S′ → . S [# ]. That is, the pseudo-token # initially appears in the lookahead set, indicating that we expect to be done after recognizing an S-sentence.
Menhir generates parsers that preemptively reduce start productions as soon as possible. Not only
does this avoid the need for feeding a special EOI
token to parsers, it makes parsers more
versatile. Menhir is full of gems.
Footnotes
-
Although parser generators commonly implement (semi-arbitrary) conflict resolution as a last resort for any and all conflicts, it’s a terrible approach that can, and does, lead to flawed parsers. ↩
-
The
Parsing
module had a long-undiscovered%nonassoc
bug, fixed in 2019. There is probably still the potential for incorrect conflict resolution if%nonassoc
and%fail
are used together. ↩