Hardly Original Compiler Compiler
hocc
is an LR(1) parser generator. Its name carries on a long tradition, to wit:
yacc
stands for “Yet Another Compiler Compiler”. Clearly the name derives from “yack”, as in, “Chuck’s dinner didn’t sit well and he yacked it.”hocc
stands for “Hardly Original Compiler Compiler”. The name derives from “hock”, as in, “Hank hocked a loogie.”Both programs interpret high-level human-written parser descriptions and produce output unfit for human consumption.
So begins the documentation I have been writing for the hocc
parser generator, which will provide
the foundation for Hemlock’s parser. It’s important to be clear about the imagery associated with
names, lest people be confused and imagine that, say, hocc
alludes to the
hawk. Indeed, yacc
to this day is confusingly associated
with the yak, which inevitably leads to uncomfortable yak
shaving conversations.
No, hocc
is not a cute and cuddly parser generator, rather it’s something of an environmentally
sustainable nuclear option [1]. For reasons described in an earlier blog
post, the Hemlock compiler will utilize
LR(1) parsing. This is a work in progress, so
for now I just want to give an overview of how hocc
will syntactically integrate with Hemlock.
Consider the following example that specifies a simple expression evaluator for expressions like
2 + 3 * 4
:
open import Basis
include hocc
left add
left mul > add
token PLUS prec add
token MINUS prec add
token STAR prec mul
token SLASH prec mul
token INT of Int.t
token EOI # End of input.
nonterm AddOp of Token.t ::=
| _:PLUS -> PLUS
| _:MINUS -> MINUS
nonterm MulOp of Token.t ::=
| _:STAR -> STAR
| _:SLASH -> SLASH
nonterm Expr of Int.t ::=
| x:INT -> x
| e0:Expr op:AddOp e1:Expr prec add ->
match op with
| AddOp PLUS -> Int.(e0 + e1)
| AddOp MINUS -> Int.(e0 - e1)
| _ -> not_reached ()
| e0:Expr op:MulOp e1:Expr prec mul ->
match op with
| MulOp STAR -> Int.(e0 * e1)
| MulOp SLASH -> Int.(e0 / e1)
| _ -> not_reached ()
start Answer of Int.t ::=
| e:Expr _:EOI -> e
The example includes some intentional design quirks in order to succinctly demonstrate features. For
example, it uses precedence-based conflict resolution to assure that 2 + 3 * 4
is parsed as
2 + (3 * 4)
rather than (2 + 3) * 4
.
Correct Incorrect
+ *
/ \ / \
2 * + 4
/ \ / \
3 4 2 3
This is all basic stuff for parser generators. For now the thing to note is how much like Hemlock
code the parser specification looks. This turns out to be surprisingly easy to implement using a
thin wrapper around Hemlock’s lexical scanner.
hocc
adds a handful of new keywords:
- Parser:
hocc
- Precedence:
prec
,left
,right
- Token:
token
- Non-terminal:
nonterm
,start
- Empty production:
epsilon
The hocc
parser of course needs to parse statements which begin with these keywords, but the
embedded Hemlock code blocks can essentially be treated as black boxes, thanks to the
indentation-defined code structure. The result is a
BNF-inspired domain-specific
language that is so tightly integrated with
Hemlock syntax that the seams barely show.