Hardly Hazardous Hoccing Gleans Great Grammars
Hemlock’s parser will be implemented via the Hardly Original Compiler Compiler
(hocc
). hocc
is still a work in
progress, yet here I am, already verbing validation viz “hoccing”, availing abundant alliteration.
There is a maddening challenge in bootstrapping hocc
, which itself is a tool for bootstrapping the
Hemlock compiler, hmc
. In broad strokes, hocc
fits into the Hemlock development project as
follows.
- Manually implement a recursive descent
parser for
hocc
’s syntax. This bootstrapping version ofhocc
is written in OCaml rather than Hemlock, sincehmc
doesn’t exist yet. - Implement the bootstrap version of
hmc
usinghocc
. - Rewrite
hmc
in Hemlock. - Rewrite
hocc
in Hemlock.
It would be tempting to add a (1a) step and reimplement hocc
’s parser using hocc
itself, but the
bootstrap hocc
will eventually be discarded, so there’s little point unless doing so catches flaws
in the hocc
grammar. I’d rather wait until step (4) to implement hocc
’s parser using hocc
, but
it’s risky to assume hocc
’s grammar is sound. Of course, there’s a great tool for grammar
validation — hocc
! Guy Steele once
spoke about how he designs language grammars:
[…] be sure that your language will parse. It seems stupid […] to sit down and start designing constructs and [not] worry about how they fit together. You can get a language that’s difficult, if not impossible, to parse, not only for a computer but for a person. I use yacc constantly as a check of all my language designs, but I very seldom use yacc in the implementation. I use it as a tester to be sure that it’s LR(1), not for the sake of the computer, but because if a language is LR(1) it’s more likely that a person can deal with it.
Recently I started manually translating various test grammars to hocc
syntax, and noticed that
with a couple minor changes, hocc
grammars can be written in a style that is strikingly close to
Backus-Naur Form (BNF). For example, in
the original hocc
syntax, a non-terminal that looked like:
nonterm PrecsListBodyTl of Unit.t ::=
| _:SEMI _:UIDENT _:PrecsListBodyTl
| epsilon
-> ()
can now be simplified to:
nonterm PrecsListBodyTl ::=
| SEMI UIDENT PrecsListBodyTl
| epsilon
I’ll close by including the current (possibly flawed) hocc
grammar. Yes, hocc
can parse this
grammar specification. Soon hocc
will also report any grammar ambiguities.
hocc
nonterm Ident ::= UIDENT | CIDENT | USCORE
nonterm PrecsListBodyTl ::=
| SEMI UIDENT PrecsListBodyTl
| epsilon
nonterm PrecsListBody ::= UIDENT PrecsListTl
nonterm Precs ::=
| LBRACK PrecsListBody RBRACK
| UIDENT
nonterm Rel ::= LT | EQ | GT
nonterm PrecRel ::= Rel Precs
nonterm PrecRels ::=
| PrecRel PrecRels
| epsilon
nonterm PrecType ::= PREC | LEFT | RIGHT
nonterm Prec ::= PrecType UIDENT PrecRels
nonterm OfType ::= OF CIDENT DOT UIDENT
nonterm OfType0 ::=
| OfType
| epsilon
nonterm PrecRef ::=
| PREC UIDENT
| epsilon
nonterm Token ::= TOKEN CIDENT OfType0 PrecRef
nonterm Sep ::= LINE_DELIM | SEMI | BAR
nonterm CodesTl ::=
| Sep Code CodesTl
| epsilon
nonterm Codes ::= Code CodesTl
nonterm Codes0 ::=
| Codes
| epsilon
nonterm Delimited ::=
| INDENT Codes DEDENT
| LPAREN Codes0 RPAREN
| LCAPTURE Codes0 RCAPTURE
| LBRACK Codes0 RBRACK
| LARRAY Codes0 RARRAY
| LCURLY Codes0 RCURLY
nonterm CodeTl ::=
| Delimited CodeTl
| CODE_TOKEN CodeTl
| epsilon
nonterm Code ::=
| Delimited CodeTl
| CODE_TOKEN CodeTl
nonterm ProdParamType ::=
| UIDENT
| CIDENT
nonterm ProdParamIdent ::=
| IDENT COLON
| epsilon
nonterm ProdParam ::= ProdParamIdent ProdParamType
nonterm ProdParamsTl ::=
| ProdParam ProdParamsTl
| epsilon
nonterm ProdParams ::= ProdParam ProdParamsTl
nonterm ProdPattern ::=
| ProdParams
| EPSILON
nonterm Prod ::= ProdPattern PrecRef
nonterm ProdsTl ::=
| BAR Prod ProdsTl
| epsilon
nonterm Prods ::=
| BAR Prod ProdsTl
| Prod ProdsTl
nonterm Reduction ::= Prods ARROW Code
nonterm ReductionsTl ::=
| BAR Reduction ReductionsTl
| epsilon
nonterm Reductions ::=
| BAR Reduction ReductionsTl
| Reduction ReductionsTl
nonterm NontermType ::= NONTERM | START
nonterm Nonterm ::=
| NontermType CIDENT OfType PrecRef COLON_COLON_EQ Reductions
| NontermType CIDENT PrecRef COLON_COLON_EQ Prods
nonterm Stmt ::=
| Prec
| Token
| Nonterm
| Code
nonterm StmtsTl ::=
| LINE_DELIM Stmt StmtsTl
| epsilon
nonterm Stmts ::= Stmt StmtsTl
nonterm Hocc ::= HOCC INDENT Stmts DEDENT
nonterm Matter ::=
| CODE_TOKEN Matter
| epsilon
start Hmh ::= Matter Hocc Matter EOI
start Hmhi ::= Matter HOCC Matter EOI
# hocc-specific keywords
token HOCC # hocc
token NONTERM # nonterm
token EPSILON # epsilon
token START # start
token PREC # prec
token LEFT # left
token RIGHT # right
# Identifiers
token UIDENT # Uncapitalized
token CIDENT # Capitalized
token USCORE # _
# Relationship operators
token LT # <
token EQ # =
token GT # >
# Punctuation/separators
token COLON_COLON_EQ# ::=
token OF # of
token DOT # .
token ARROW # ->
token BAR # |
token SEMI # ;
token LINE_DELIM # Line delimiter
# Left-right paired delimiters
token INDENT # Indent
token DEDENT # Dedent
token LPAREN # (
token RPAREN # )
token LCAPTURE # (|
token RCAPTURE # |)
token LBRACK # [
token RBRACK # ]
token LARRAY # [|
token RARRAY # |]
token LCURLY # {
token RCURLY # }
# Miscellaneous Hemlock token in embedded code
token CODE_TOKEN
# End of input, used to terminate start symbols
token EOI