Dentation Matters
We typically speak of “indentation” in the context of programming languages, but without dedentation code gets marginal pretty quickly. Hemlock eschews what’s in; chews dentition. Er, choose dentation. Onward before the puns overwhelm.
I have been reimplementing Hemlock’s scanner for the past several weeks to convert it to a deterministic finite automaton (DFA). The main purpose of this rewrite is to ease the implementation of interpolated string formatting, but it has forced more rigor throughout the scanner.
As mentioned in the context of parsing error recovery, dentation serves as a strong heuristic indicator of code structure even for languages which don’t codify dentation, which led us to make dentation semantically meaningful in Hemlock. But we chose much stricter rules.
- Tabs are forbidden in whitespace. (Tabs can be embedded only in string literals and comments.)
- The first non-space codepoint establishes line indentation.
- Block indentation is four columns.
- Expression continuation indentation is two columns. One-column and three-column indentation are always invalid, which eliminates undetected off-by-one errors.
- The raw
\n
at the end of a line is treated as non-breaking whitespace if immediately preceded by a\
. This acts as an escape hatch which allows arbitrary continuation indentation. - Lines comprising only comments and/or whitespace are ignored.
- Consecutive expressions at the same indentation are distinct expressions.
Mistakes were made
During reimplementation I discovered two serious flaws, as well as a missed optimization. First the flaws:
- Line separator tokens were being omitted for some code.
a b c d
This should be equivalent to
a (b; c); d
, but the scanner was emitting tokens that treated it asa (b; c) d
. I surely would have tripped on this during parser implementation, but it’s nice to already have a fix. \
provides an escape hatch for indentation, in that the following line can have arbitrary leading whitespace. However, the backslash should establish line indentation if it is the first non-space codepoint.a \ b c d
This was being scanned as if equivalent to
a; b c; d
instead ofa (b; c); d
. Chances are this bug would have taken a long time to discover organically.
The missed optimization has to do with leading (* ... *)
comments. We allow arbitrary indentation
of comment-only lines because there are valid reasons to be creative/sloppy with comments.
let f x =
match foo x with
| A -> ...
# Some strangely indented comment about B.
| B ->
let y = ...
(* Disable C for the moment.
| C -> ...
*)
| _ -> ...
This presents a challenge though, because (* ... *)
comments can be followed by code on the same
line, and in those cases indentation matters.
let lst = [
(* Detritus. *)
0
(* Ouch. *) 1
(* More ouch. *) (*
* Srsly? *) 2
(* This one's an error! *) 3
]
In the original implementation I used arbitrary lookahead to determine whether the line contained anything besides whitespace and comments. But that approach is a serious pain in the context of a DFA. It was such a messy corner case that I considered prohibiting code following comments, but then I realized that it’s fine to reorder comment tokens and synthesized indent/dedent tokens. The scanner records the line’s indentation as established by the leading comment, but lazily utilizes that information only if non-whitespace/comment tokens follow.
In practice the arbitrary lookahead didn’t incur a high cost because programmers rarely write code following comments on the same line, and it was possible to scan the line only once in the common case. But what we have now is both faster and simpler, my favorite kind of code.