End of the #line
Hemlock is a systems programming language, but it fits into the spectrum of languages in a slightly different place than other languages. This post focuses on resulting interactions with metaprogramming, i.e. treating programs as input to code generation.
Some languages are intended for use even in bare-metal applications like operating system kernels,
and macros are an important feature at
that level.
C/C++
use the C preprocessor (cpp
) as a rather blunt
tool; languages like Rust instead use hygienic
macros to improve integration and safety.
Higher-level languages tend to eschew macros in favor of AST transformations, whether at compile
time as for Java
annotation processors and
OCaml ppx rewriters, or as
dynamic reflection logic based on run-time
type information (RTTI) in languages like
Python. Some languages provide multiple metaprogramming facilities, but
with Hemlock we are attempting to minimize language facility redundancy. RTTI is right out since
Hemlock semantically erases types (which is a lie the garbage collector needs to be in on). Macros
are also out because they incur a great deal of tooling complexity. But we need to support
metaprogramming somehow!
Hemlock doesn’t directly support macros, but it provides two other choices for metaprogramming:
- Pure computations can, at least in principle, be computed at compile time and their results propagated as constants. Some languages like D and Zig push this approach pretty far, but there are theoretical and technical challenges I won’t get into here.
- Programs can ingest domain-specific language (DSL) input and generate Hemlock code. A prime example of this is parser generators à la yacc, bison, and Menhir. Such DSLs commonly require the programmer to write snippets of the target language that are copied verbatim into the result, Hemlock code in our case.
The DSL approach can provide extremely high-level interfaces that only require the programmer to
learn a modicum of syntax in addition to domain-specific knowledge and the ability to program in
Hemlock. But what if the programmer introduces a bug in the Hemlock code that is embedded in the
DSL? Productivity suffers greatly if the programmer has to reverse-engineer the generated code and
discover the correspondence to the DSL. Most languages have solved this problem via line
directives like those
supported by C/C++. And here we hit a Hemlock-specific challenge: semantically significant code
indentation makes #line
unworkable!
Source directives, not line directives
Consider a contrived simple example wherein we inline the body of square
everywhere it is called.
(Pretend the leading line/column numbering doesn’t exist in the actual source file.)
1 1
0___4___8___2___6_
1|# Inline.
2|square x =
3| x * x
4|
5|square_square x =
6| (square x)
7| * (square x)
Barring reindentation, which would break the direct correspondence with the original, the following
is the best we can do with the #line
-based approach. Note that the x * x
lines are
under-indented.
#line 5
square_square x =
(
#line 3
x * x
#line 6
)
* (
#line 3
x * x
#line 7
)
One solution might be to re-indent the inlined code, but doing so introduces some gnarly edge cases. In this case there would be phantom leading spaces with no origin. And supposing we wanted to reindent but attribute the leading spaces to the output source, we’d need sub-line directives. Even sub-line directives aren’t enough to solve all the problems. We explored a bunch of possible solutions, and eventually decided that re-indenting causes extra problems that should just be avoided. Our solution is source directives which encode indentation and number of omitted non-indentation columns. Here’s what the example output might look like.
[:5]square_square x =
([:3:4+0]x * x[:])
* ([3:4+0]x * x[:])
The [3:4+0]
directive includes code starting at the beginning of line 3, which has 4 spaces of
indentation (0 leading non-space codepoints are omitted), and [:]
resets location tracking to the
primary source file. There’s a hygiene problem with our example though: the generated code cannot
embed the whitespace which defines the block comprising square
’s body. Here’s a revised input that
will help resolve that problem.
1 1
0___4___8___2___6_
1|# Inline.
2|square x = (
3| x * x
4| )
5|
6|square_square x =
7| (square x)
8| * (square x)
The parentheses which enclose square
’s body now make it possible to embed the entire body.
[:6]square_square x =
([:2:0+12](
x * x
)[:])
* ([2:0+12](
x * x
)[:])
Note that source directives are intentionally non-hygienic, because otherwise it would be impossible to use them for balanced prelude/postlude code. Yes, source directives are foot guns, but fortunately they are not intended for human use.
Incidental complexity
Source directives solve a problem that we introduced by requiring strict adherence to indentation rules. Laxer parsing would avoid the problem, but at the cost of extra complexity in common use cases. Source directives are, as near as I can tell, unique in the programming language universe. That causes me some concern, because we had to develop original technology for something that is not central to Hemlock. On the other hand, we did this because prior art demonstrates that semantic indentation aids error recovery, and we’re going all-in on that approach. Hemlock may be trying to solve too many language design problems at once, but fortunately superficial syntax matters tend not to interact deeply with language semantics, so hopefully we’ll get away with it in this case.