BranchTaken

Hemlock language insights

Complex Path Complex

Paths

The hocc parser generator was but an elaborate design until recently, but actual code is starting to accumulate. Paraphrasing Moltke the Elder, no design survives with certainty an earnest attempt at implementation. Nothing has gone wrong, really, but I found myself repeatedly writing brittle code to manipulate filesystem paths, and a few rounds of that compelled me to step back and implement a Path module. This was a surprisingly difficult programming problem, thanks to interactions among various edge cases, one of which was new to me despite having been around the block a few times.

Hemlock only deals with Unix-like filesystem path semantics, which means that the operating system unifies all devices as a single logical tree rooted at /, aka the root directory. Path segments are separated by /, and each directory contains:

A path like a/b/c has three path “segments”, where a and b are directories (or symbolic links, a consideration which I’ll gloss over), and c could be a directory or a file. In many cases a path can be normalized to a simpler form, e.g. a//b/./c/../d/ normalizes to a/b/d.

With that minimal primer out of the way, let’s dive into two issues in the complex path complex [2].

UTF-8 encoding (or not)

Clearly segments cannot have / in their names, and the only other exclusion is nul (\0). In this modern Unicode world that’s a blessing in that it’s easy to use UTF-8 in paths, but a curse in that segment names can be invalid UTF-8. Hemlock thoroughly embraces UTF-8, but the Path module has to accommodate arbitrary bytes in segment names. Oh well, at least the mixed encoding wart doesn’t interact with the other complications.

Special URI-like prefix

While reading Python’s documentation for its os.path.normpath function I learned that pathname resolution grants special meaning to //! Mind you, /, ///, ////, etc. all mean the same thing as /, but // is different.

A pathname consisting of a single <slash> shall resolve to the root directory of the process. A null pathname shall not be successfully resolved. If a pathname begins with two successive <slash> characters, the first component following the leading <slash> characters may be interpreted in an implementation-defined manner, although more than two leading <slash> characters shall be treated as a single <slash> character.

This adds all sorts of fun edge cases. For example, // and ///a both contain three leading empty segments, yet only the latter can be simplified (to /a).

I’m not sure when filesystems started treating the leading // specially, but I’m guessing it was somehow inspired by the authority prefix in uniform resource identifiers (URIs), some time after I was taught the fundamentals of filesystems. There’s always something new to learn!

Footnotes

  1. Only as I wrote about the root directory being its own parent did I realize that the Path module wasn’t simplifying e.g. /../a to /a as it should. Clearly the rubber duck deserved a bigger role in this endeavor. 

  2. Not to be confused with the fear of over-engineered buildings, aka the complex complex complex.