BranchTaken

Hemlock language insights

End of the #line

End of the line

Hemlock is a systems programming language, but it fits into the spectrum of languages in a slightly different place than other languages. This post focuses on resulting interactions with metaprogramming, i.e. treating programs as input to code generation.

Some languages are intended for use even in bare-metal applications like operating system kernels, and macros are an important feature at that level. C/C++ use the C preprocessor (cpp) as a rather blunt tool; languages like Rust instead use hygienic macros to improve integration and safety. Higher-level languages tend to eschew macros in favor of AST transformations, whether at compile time as for Java annotation processors and OCaml ppx rewriters, or as dynamic reflection logic based on run-time type information (RTTI) in languages like Python. Some languages provide multiple metaprogramming facilities, but with Hemlock we are attempting to minimize language facility redundancy. RTTI is right out since Hemlock semantically erases types (which is a lie the garbage collector needs to be in on). Macros are also out because they incur a great deal of tooling complexity. But we need to support metaprogramming somehow!

Hemlock doesn’t directly support macros, but it provides two other choices for metaprogramming:

The DSL approach can provide extremely high-level interfaces that only require the programmer to learn a modicum of syntax in addition to domain-specific knowledge and the ability to program in Hemlock. But what if the programmer introduces a bug in the Hemlock code that is embedded in the DSL? Productivity suffers greatly if the programmer has to reverse-engineer the generated code and discover the correspondence to the DSL. Most languages have solved this problem via line directives like those supported by C/C++. And here we hit a Hemlock-specific challenge: semantically significant code indentation makes #line unworkable!

Source directives, not line directives

Consider a contrived simple example wherein we inline the body of square everywhere it is called. (Pretend the leading line/column numbering doesn’t exist in the actual source file.)

              1   1
  0___4___8___2___6_
1|# Inline.
2|square x =
3|    x * x
4|
5|square_square x =
6|    (square x)
7|      * (square x)

Barring reindentation, which would break the direct correspondence with the original, the following is the best we can do with the #line-based approach. Note that the x * x lines are under-indented.

#line 5
square_square x =
    (
#line 3
    x * x
#line 6
      )
      * (
#line 3
    x * x
#line 7
      )

One solution might be to re-indent the inlined code, but doing so introduces some gnarly edge cases. In this case there would be phantom leading spaces with no origin. And supposing we wanted to reindent but attribute the leading spaces to the output source, we’d need sub-line directives. Even sub-line directives aren’t enough to solve all the problems. We explored a bunch of possible solutions, and eventually decided that re-indenting causes extra problems that should just be avoided. Our solution is source directives which encode indentation and number of omitted non-indentation columns. Here’s what the example output might look like.

[:5]square_square x =
    ([:3:4+0]x * x[:])
      * ([3:4+0]x * x[:])

The [3:4+0] directive includes code starting at the beginning of line 3, which has 4 spaces of indentation (0 leading non-space codepoints are omitted), and [:] resets location tracking to the primary source file. There’s a hygiene problem with our example though: the generated code cannot embed the whitespace which defines the block comprising square’s body. Here’s a revised input that will help resolve that problem.

              1   1
  0___4___8___2___6_
1|# Inline.
2|square x = (
3|    x * x
4|  )
5|
6|square_square x =
7|    (square x)
8|      * (square x)

The parentheses which enclose square’s body now make it possible to embed the entire body.

[:6]square_square x =
    ([:2:0+12](
    x * x
  )[:])
      * ([2:0+12](
    x * x
  )[:])

Note that source directives are intentionally non-hygienic, because otherwise it would be impossible to use them for balanced prelude/postlude code. Yes, source directives are foot guns, but fortunately they are not intended for human use.

Incidental complexity

Source directives solve a problem that we introduced by requiring strict adherence to indentation rules. Laxer parsing would avoid the problem, but at the cost of extra complexity in common use cases. Source directives are, as near as I can tell, unique in the programming language universe. That causes me some concern, because we had to develop original technology for something that is not central to Hemlock. On the other hand, we did this because prior art demonstrates that semantic indentation aids error recovery, and we’re going all-in on that approach. Hemlock may be trying to solve too many language design problems at once, but fortunately superficial syntax matters tend not to interact deeply with language semantics, so hopefully we’ll get away with it in this case.