State of the hocc Grammar

The hocc parser generator implementation is coming along nicely. The parser is stable, the first pass of semantic analysis is implemented, and no known hocc grammar issues remain. This is a good time to draw attention to a minor syntax refinement that was prompted by unnecessarily awkward code comments in the previously posted hocc grammar. Note the comments trailing each token statement, e.g.

hocc
    # ...

    # Left-right paired delimiters
    token INDENT        # Indent
    token DEDENT        # Dedent
    token LPAREN        # (
    token RPAREN        # )
    token LCAPTURE      # (|
    token RCAPTURE      # |)
    # ...

It would be nicer to reference tokens with invariant representations literally rather than by name. The yacc parser generator supports literal tokens delimited by single quotes, e.g. '(' for LPAREN. However, such yacc tokens only have literal representation, whereas hocc needs tokens to have actual names which correspond to token variant constructor labels in the generated code. The solution in hocc is to support aliases.

    # ...

    # Left-right paired delimiters
    token INDENT
    token DEDENT
    token LPAREN "("
    token RPAREN ")"
    token LCAPTURE "(|"
    token RCAPTURE "|)"
    # ...

And the aliases can be used in production patterns, e.g.

hocc
    # ...

    nonterm Delimited ::=
      | INDENT Codes DEDENT
      | "(" Codes0 ")"
      | "(|" Codes0 "|)"
      | "[" Codes0 "]"
      | "[|" Codes0 "|]"
      | "{" Codes0 "}"

    # ...

Following is the current hocc grammar, which of course takes advantage of token aliases. There are also a few minor fixes that were prompted by grammar validation in hocc’s first pass of semantic analysis.

hocc
    # hocc-specific keywords
    token HOCC "hocc"
    token NONTERM "nonterm"
    token EPSILON "epsilon"
    token START "start"
    token TOKEN "token"
    token PREC "prec"
    token LEFT "left"
    token RIGHT "right"

    # Identifiers
    token UIDENT # Uncapitalized
    token CIDENT # Capitalized
    token USCORE "_"

    # Token alias
    token STRING

    # Punctuation/separators
    token COLON_COLON_EQ "::="
    token OF "of"
    token COLON ":"
    token DOT "."
    token ARROW "->"
    token BAR "|"
    token LT "<"
    token COMMA ","
    token SEMI ";"
    token LINE_DELIM

    # Left-right paired delimiters
    token INDENT
    token DEDENT
    token LPAREN "("
    token RPAREN ")"
    token LCAPTURE "(|"
    token RCAPTURE "|)"
    token LBRACK "["
    token RBRACK "]"
    token LARRAY "[|"
    token RARRAY "|]"
    token LCURLY "{"
    token RCURLY "}"

    # Miscellaneous Hemlock token in embedded code
    token CODE_TOKEN

    # End of input, used to terminate start symbols
    token EOI

    nonterm Ident ::= UIDENT | CIDENT | "_"

    nonterm PrecsTl ::=
      | "," UIDENT PrecsTl
      | epsilon

    nonterm Precs ::= UIDENT PrecsTl

    nonterm PrecRels ::=
      | "<" Precs
      | epsilon

    nonterm PrecType ::= "prec" | "left" | "right"

    nonterm Prec ::= PrecType UIDENT PrecRels

    nonterm OfType ::= "of" CIDENT "." UIDENT

    nonterm OfType0 ::=
      | OfType
      | epsilon

    nonterm PrecRef ::=
      | "prec" UIDENT
      | epsilon

    nonterm TokenAlias ::=
      | STRING
      | epsilon

    nonterm Token ::= "token" CIDENT TokenAlias OfType0 PrecRef

    nonterm Sep ::= LINE_DELIM | ";" | "|"

    nonterm CodesTl ::=
      | Sep Code CodesTl
      | epsilon

    nonterm Codes ::= Code CodesTl

    nonterm Codes0 ::=
      | Codes
      | epsilon

    nonterm Delimited ::=
      | INDENT Codes DEDENT
      | "(" Codes0 ")"
      | "(|" Codes0 "|)"
      | "[" Codes0 "]"
      | "[|" Codes0 "|]"
      | "{" Codes0 "}"

    nonterm CodeTl ::=
      | Delimited CodeTl
      | CODE_TOKEN CodeTl
      | epsilon

    nonterm Code ::=
      | Delimited CodeTl
      | CODE_TOKEN CodeTl

    nonterm ProdParamType ::=
      | CIDENT
      | STRING

    nonterm ProdParamIdent ::=
      | Ident ":"
      | epsilon

    nonterm ProdParam ::= ProdParamIdent ProdParamType

    nonterm ProdParamsTl ::=
      | ProdParam ProdParamsTl
      | epsilon

    nonterm ProdParams ::= ProdParam ProdParamsTl

    nonterm ProdPattern ::=
      | ProdParams
      | "epsilon"

    nonterm Prod ::= ProdPattern PrecRef

    nonterm ProdsTl ::=
      | "|" Prod ProdsTl
      | epsilon

    nonterm Prods ::=
      | "|" Prod ProdsTl
      | Prod ProdsTl

    nonterm Reduction ::= Prods "->" Code

    nonterm ReductionsTl ::=
      | "|" Reduction ReductionsTl
      | epsilon

    nonterm Reductions ::=
      | "|" Reduction ReductionsTl
      | Reduction ReductionsTl

    nonterm NontermType ::= "nonterm" | "start"

    nonterm Nonterm ::=
      | NontermType CIDENT PrecRef "::=" Prods
      | NontermType CIDENT OfType PrecRef "::=" Reductions

    nonterm Stmt ::=
      | Prec
      | Token
      | Nonterm
      | Code

    nonterm StmtsTl ::=
      | LINE_DELIM Stmt StmtsTl
      | epsilon

    nonterm Stmts ::= Stmt StmtsTl

    nonterm Hocc ::= "hocc" INDENT Stmts DEDENT

    nonterm Matter ::=
      | CODE_TOKEN Matter
      | epsilon

    start Hmh ::= Matter Hocc Matter EOI

    start Hmhi ::= Matter "hocc" Matter EOI

BranchTaken

Hemlock language insights

State of the hocc Grammar