BranchTaken

Hemlock language insights

Dentation Matters

Extreme dentation

We typically speak of “indentation” in the context of programming languages, but without dedentation code gets marginal pretty quickly. Hemlock eschews what’s in; chews dentition. Er, choose dentation. Onward before the puns overwhelm.

I have been reimplementing Hemlock’s scanner for the past several weeks to convert it to a deterministic finite automaton (DFA). The main purpose of this rewrite is to ease the implementation of interpolated string formatting, but it has forced more rigor throughout the scanner.

As mentioned in the context of parsing error recovery, dentation serves as a strong heuristic indicator of code structure even for languages which don’t codify dentation, which led us to make dentation semantically meaningful in Hemlock. But we chose much stricter rules.

Mistakes were made

During reimplementation I discovered two serious flaws, as well as a missed optimization. First the flaws:

The missed optimization has to do with leading (* ... *) comments. We allow arbitrary indentation of comment-only lines because there are valid reasons to be creative/sloppy with comments.

let f x =
    match foo x with
      | A -> ...

        # Some strangely indented comment about B.
      | B ->
        let y = ...
(* Disable C for the moment.
      | C -> ...
*)
      | _ -> ...

This presents a challenge though, because (* ... *) comments can be followed by code on the same line, and in those cases indentation matters.

let lst = [
  (* Detritus. *)
    0
    (* Ouch. *) 1
    (* More ouch. *) (*
     * Srsly? *) 2
  (* This one's an error! *) 3
  ]

In the original implementation I used arbitrary lookahead to determine whether the line contained anything besides whitespace and comments. But that approach is a serious pain in the context of a DFA. It was such a messy corner case that I considered prohibiting code following comments, but then I realized that it’s fine to reorder comment tokens and synthesized indent/dedent tokens. The scanner records the line’s indentation as established by the leading comment, but lazily utilizes that information only if non-whitespace/comment tokens follow.

In practice the arbitrary lookahead didn’t incur a high cost because programmers rarely write code following comments on the same line, and it was possible to scan the line only once in the common case. But what we have now is both faster and simpler, my favorite kind of code.