BranchTaken

Hemlock language insights

Hardly Hazardous Hoccing Gleans Great Grammars

Radiation hazard

Hemlock’s parser will be implemented via the Hardly Original Compiler Compiler (hocc). hocc is still a work in progress, yet here I am, already verbing validation viz “hoccing”, availing abundant alliteration. There is a maddening challenge in bootstrapping hocc, which itself is a tool for bootstrapping the Hemlock compiler, hmc. In broad strokes, hocc fits into the Hemlock development project as follows.

  1. Manually implement a recursive descent parser for hocc’s syntax. This bootstrapping version of hocc is written in OCaml rather than Hemlock, since hmc doesn’t exist yet.
  2. Implement the bootstrap version of hmc using hocc.
  3. Rewrite hmc in Hemlock.
  4. Rewrite hocc in Hemlock.

It would be tempting to add a (1a) step and reimplement hocc’s parser using hocc itself, but the bootstrap hocc will eventually be discarded, so there’s little point unless doing so catches flaws in the hocc grammar. I’d rather wait until step (4) to implement hocc’s parser using hocc, but it’s risky to assume hocc’s grammar is sound. Of course, there’s a great tool for grammar validation — hocc! Guy Steele once spoke about how he designs language grammars:

[…] be sure that your language will parse. It seems stupid […] to sit down and start designing constructs and [not] worry about how they fit together. You can get a language that’s difficult, if not impossible, to parse, not only for a computer but for a person. I use yacc constantly as a check of all my language designs, but I very seldom use yacc in the implementation. I use it as a tester to be sure that it’s LR(1), not for the sake of the computer, but because if a language is LR(1) it’s more likely that a person can deal with it.

Recently I started manually translating various test grammars to hocc syntax, and noticed that with a couple minor changes, hocc grammars can be written in a style that is strikingly close to Backus-Naur Form (BNF). For example, in the original hocc syntax, a non-terminal that looked like:

    nonterm PrecsListBodyTl of Unit.t ::=
      | _:SEMI _:UIDENT _:PrecsListBodyTl
      | epsilon
      -> ()

can now be simplified to:

    nonterm PrecsListBodyTl ::=
      | SEMI UIDENT PrecsListBodyTl
      | epsilon

I’ll close by including the current (possibly flawed) hocc grammar. Yes, hocc can parse this grammar specification. Soon hocc will also report any grammar ambiguities.

hocc
    nonterm Ident ::= UIDENT | CIDENT | USCORE

    nonterm PrecsListBodyTl ::=
      | SEMI UIDENT PrecsListBodyTl
      | epsilon

    nonterm PrecsListBody ::= UIDENT PrecsListTl

    nonterm Precs ::=
      | LBRACK PrecsListBody RBRACK
      | UIDENT

    nonterm Rel ::= LT | EQ | GT

    nonterm PrecRel ::= Rel Precs

    nonterm PrecRels ::=
      | PrecRel PrecRels
      | epsilon

    nonterm PrecType ::= PREC | LEFT | RIGHT

    nonterm Prec ::= PrecType UIDENT PrecRels

    nonterm OfType ::= OF CIDENT DOT UIDENT

    nonterm OfType0 ::=
      | OfType
      | epsilon

    nonterm PrecRef ::=
      | PREC UIDENT
      | epsilon

    nonterm Token ::= TOKEN CIDENT OfType0 PrecRef

    nonterm Sep ::= LINE_DELIM | SEMI | BAR

    nonterm CodesTl ::=
      | Sep Code CodesTl
      | epsilon

    nonterm Codes ::= Code CodesTl

    nonterm Codes0 ::=
      | Codes
      | epsilon

    nonterm Delimited ::=
      | INDENT Codes DEDENT
      | LPAREN Codes0 RPAREN
      | LCAPTURE Codes0 RCAPTURE
      | LBRACK Codes0 RBRACK
      | LARRAY Codes0 RARRAY
      | LCURLY Codes0 RCURLY

    nonterm CodeTl ::=
      | Delimited CodeTl
      | CODE_TOKEN CodeTl
      | epsilon

    nonterm Code ::=
      | Delimited CodeTl
      | CODE_TOKEN CodeTl

    nonterm ProdParamType ::=
      | UIDENT
      | CIDENT

    nonterm ProdParamIdent ::=
      | IDENT COLON
      | epsilon

    nonterm ProdParam ::= ProdParamIdent ProdParamType

    nonterm ProdParamsTl ::=
      | ProdParam ProdParamsTl
      | epsilon

    nonterm ProdParams ::= ProdParam ProdParamsTl

    nonterm ProdPattern ::=
      | ProdParams
      | EPSILON

    nonterm Prod ::= ProdPattern PrecRef

    nonterm ProdsTl ::=
      | BAR Prod ProdsTl
      | epsilon

    nonterm Prods ::=
      | BAR Prod ProdsTl
      | Prod ProdsTl

    nonterm Reduction ::= Prods ARROW Code

    nonterm ReductionsTl ::=
      | BAR Reduction ReductionsTl
      | epsilon

    nonterm Reductions ::=
      | BAR Reduction ReductionsTl
      | Reduction ReductionsTl

    nonterm NontermType ::= NONTERM | START

    nonterm Nonterm ::=
      | NontermType CIDENT OfType PrecRef COLON_COLON_EQ Reductions
      | NontermType CIDENT PrecRef COLON_COLON_EQ Prods

    nonterm Stmt ::=
      | Prec
      | Token
      | Nonterm
      | Code

    nonterm StmtsTl ::=
      | LINE_DELIM Stmt StmtsTl
      | epsilon

    nonterm Stmts ::= Stmt StmtsTl

    nonterm Hocc ::= HOCC INDENT Stmts DEDENT

    nonterm Matter ::=
      | CODE_TOKEN Matter
      | epsilon

    start Hmh ::= Matter Hocc Matter EOI

    start Hmhi ::= Matter HOCC Matter EOI

    # hocc-specific keywords
    token HOCC          # hocc
    token NONTERM       # nonterm
    token EPSILON       # epsilon
    token START         # start
    token PREC          # prec
    token LEFT          # left
    token RIGHT         # right

    # Identifiers
    token UIDENT        # Uncapitalized
    token CIDENT        # Capitalized
    token USCORE        # _

    # Relationship operators
    token LT            # <
    token EQ            # =
    token GT            # >

    # Punctuation/separators
    token COLON_COLON_EQ# ::=
    token OF            # of
    token DOT           # .
    token ARROW         # ->
    token BAR           # |
    token SEMI          # ;
    token LINE_DELIM    # Line delimiter

    # Left-right paired delimiters
    token INDENT        # Indent
    token DEDENT        # Dedent
    token LPAREN        # (
    token RPAREN        # )
    token LCAPTURE      # (|
    token RCAPTURE      # |)
    token LBRACK        # [
    token RBRACK        # ]
    token LARRAY        # [|
    token RARRAY        # |]
    token LCURLY        # {
    token RCURLY        # }

    # Miscellaneous Hemlock token in embedded code
    token CODE_TOKEN

    # End of input, used to terminate start symbols
    token EOI