Simply Precise Documentation

Nearly a year ago I finished implementing Hemlock’s scanner, or so I thought. The past two months I have been methodically reimplementing the entire scanner in order to accommodate the additional complexity of format strings, work that is finally complete! The two implementation efforts were far enough apart to require carefully reading the design documentation to refresh my memory on the details. Along the way I encountered various complexities that were finicky to implement. My initial reluctance to reimplement these quirks was due to the implementation effort required, but I increasingly found myself balking at the complicated descriptions themselves. Yet I wanted the precision of the documentation to increase. The only way to reduce the documentation complexity was to simplify the design, and in many cases I found simplifications that improved Hemlock. After a few successes I adopted a default attitude:

If it’s hard to explain, it’s hard to use.

The rest of this blog post shows several examples of how this played out.

Backslash continuation

Hemlock syntax started out very OCaml-like, but we decided to improve error recovery by making indentation semantically meaningful. At the time we worried that requiring absolute adherence to the indentation rules would be problematic in some edge cases, so we provided an escape hatch: backslash continuation. A backslash immediately preceding a newline caused the newline to be ignored. This enabled free-form code alignment.

# Ridiculously formatted addition.
x = \
  1 + \
    2 + \
      3 + \
        4 + \
      3 + \
    2 + \
  1

Here we are a couple years later, with several thousand lines of prototype Hemlock code accumulated, and not a single compelling use for backslash continuation has arisen. As the rough edges were knocked off the formatting guidelines, this escape hatch increasingly stuck out as a poorly justified special case. “Use sparingly” became “use sparingly, if at all”, and finally it was obvious the escape hatch shouldn’t exist at all.

Raw string newline stripping

Suppose you are embedding a program usage string that will be printed if the program is invoked as lorem -h. It’s a safe bet that the terminal will be at least 80 columns wide, and you’d like to format the usage string to use that full width. But what if the the string is embedded in an indented code context? Your best bet is to format the code something like:

    usage formatter =
        formatter |> Fmt.fmt ``
lorem usage:
    lorem -h
    lorem ipsum

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.

``

Note that the string contains leading and trailing newlines ('\n'). In this particular case the leading newline is undesirable, though one trailing newline is needed. We reasoned that raw strings would be more useful if one leading and one trailing newline were automatically stripped. This is what the documentation said:

If the raw string begins and/or ends with a \n, that codepoint is omitted. This allows raw string delimiters to be on separate lines from the string contents without changing the string.
``Single-line raw string``

``
Single-line raw string
``

``

Three-line raw string

``

In retrospect we over-fit the design to an imagined use case that wasn’t generally representative. The newline stripping is confusing when you don’t want it, which turns out to be more common than imagined. Raw strings are now truly raw. The usage example would now use an explicit String.lstrip call to strip the leading newline:

    usage formatter =
        formatter |> Fmt.fmt
            String.lstrip ``
lorem usage:
    lorem -h
    lorem ipsum

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.
``

Documentation: strings → comments → attributes

For the past couple decades most programming languages have provided mechanisms for embedding API documentation in source code comments such that it can be automatically extracted, e.g. via a third-party tool like doxygen, or via language-specific tooling for doc comments or docstrings. We planned to use doc strings in Hemlock, but this proved tricky to integrate cleanly once the nuances of semantically significant indentation were all accounted for. For reference here’s a small example of what doc string syntax looked like (in a happy case):

val hash_fold 'a: (a -> Hash.State.t -> Hash.State.t) -> _&t a -> Hash.State.t -> Hash.State.t
  `|`hash_fold hash_fold_a t state` incorporates the hash of `t` into `state` and returns the
   |resulting state. Array elements are sequentially hash-folded into the resulting state via
   |`hash_fold_a`.
  `

When the doc string syntax started breaking down I decided to just copy what OCaml does with doc comments. This made some sense because Hemlock leaves comments out of its strict indentation regime. However, there are some unfortunate oddities with regard to how OCaml’s doc comments attach to preceding and/or following interfaces. After a bunch of prototyping I came up with an unambiguous doc comment syntax, but it definitely felt stilted in a couple places (note the doc comment placement for variant constructors, Hi/Hello/Bye).

(** Re: file-level module. *)

(** Unattached doc comment. *)

(** Re: `type t`. *)
type t: t = uns

(** Unattached doc comment. *)

(** Re: `type v`. *)
type v: v =
  | (** Re: `Hi`. *)
  Hi
  | (** Re: `Hello`. *)
  Hello of string
  | (** Re: `Bye`. *)
  Bye

(** Re: `type r`. *)
type r: r = {
    (* Re: `x`. *)
    x: uns
    (** Re: `s`. *)
    s: string
  }

(** Re: `T`. *)
T = {
    (** Re: `x`. *)
    x = 42
    (** Unattached doc comment. *)

    (** Unattached doc comment. *)
    (** Re: `f`. *)
    f x =
        # ...
    (** Re: `U`. *)
    U = {
        (* ... *)
      }
    (** Unattached comment. *)
  }

This seemed adequate, but it bothered me that comments were being used as an escape from indentation requirements, so I kept looking for cleaner solutions. I noted while reading about OCaml’s doc comments that they are transformed to attributes. Well, attributes do fit cleanly in Hemlock’s syntax, and it makes a lot of sense for Hemlock to directly use attributes for embedded documentation. I’m curious why OCaml added both doc strings and attributes in the 4.03 release rather than just attributes, but those decisions slightly predate my use of OCaml and spelunking the revision history didn’t turn up any answers. Anyway, here’s what Hemlock’s doc attributes look like:

[@@@doc "Re: file-level module."]

[@@@doc "Unattached doc comment."]

type t: t = uns
  [@@doc "Re: `type t`."]

[@@@doc "Unattached doc comment."]

type v: v =
  | Hi              [@doc "Re: `Hi`."]
  | Hello of string [@doc "Re: `Hello`."]
  | Bye             [@doc "Re: `Bye`."]
  [@@doc "Re: `type v`."]

type r: r = {
    x: uns    [@@doc "Re: `x`."]
    s: string [@@doc "Re: `s`."]
  }
  [@@doc "Re: `type r`."]

T = {
    x = 42
      [@@doc "Re: `x`."]
    [@@@doc "Unattached doc comment."]

    [@@@doc "Unattached doc comment."]
    f x =
        # ...
      [@@doc "Re: `f`."]
    U = {
        (* ... *)
      }
      [@@doc "Re: `U`."]
    [@@@doc "Unattached comment."]
  }
  [@@doc "Re: `T`."]

Bar-margin strings

Hemlock’s documentation extractor assumes the embedded documentation is Markdown-formatted. We chose Markdown partly because doing so vastly reduces design, implementation, and maintenance costs. But we also chose Markdown because it is currently ubiquitous and programmers don’t have to learn a special tool to write documentation.

Markdown has the nice property that outside of quoted blocks single line breaks do not affect layout of the rendered output. However, indentation does matter. We reasoned that it would be difficult to correctly manage indentation without a clearly defined left margin, so we came up with bar-margin strings.

    s = `|This is a bar-margin string,
         |where the leading `|` serves
         |as a visible left margin.
         |
         |- Bullet
         |- Bullet
         |
         |    Indented
         |    block
        `

In retrospect there were some really unfortunate aspects to this design.

I had to write a non-trivial program to automatically rewrap such strings, because the bar margin directly abuts the string contents. This would be a challenge for every editor/IDE.
If the code preceding the string delimiter changes length, the entire string’s indentation must change, which potentially means rewrapping. This behavior accidentally mimics what I consider a shortcoming in Haskell’s indentation rules.

The best “fix” was to simply remove this syntax. This places a bit of responsibility on the documentation generator to normalize leading whitespace, but the normalization rules are simple to explain and work with. This generally allows the programmer to block-indent the embedded documentation as if it were a nested block within the code, or for simple single-paragraph documentation, just wrap it.

BranchTaken

Hemlock language insights

Simply Precise Documentation

Backslash continuation

Raw string newline stripping

Documentation: strings → comments → attributes

Bar-margin strings