Complex Path Complex
The hocc
parser generator was but an
elaborate design until recently, but actual code is starting to accumulate. Paraphrasing Moltke the
Elder, no design survives with
certainty an earnest attempt at implementation. Nothing has gone wrong, really, but I found myself
repeatedly writing brittle code to manipulate filesystem paths, and a few rounds of that compelled
me to step back and implement a Path
module. This was a surprisingly difficult programming
problem, thanks to interactions among various edge cases, one of which was new to me despite having
been around the block a few times.
Hemlock only deals with Unix-like filesystem path semantics,
which means that the operating system unifies all devices as a single logical tree rooted at /
,
aka the root directory. Path segments are separated by /
, and each directory contains:
.
: A link to the current directory...
: A link to the parent directory (/
is its own parent) [1].- Optional: Links to other directories and/or files.
A path like a/b/c
has three path “segments”, where a
and b
are directories (or symbolic links,
a consideration which I’ll gloss over), and c
could be a directory or a file. In many cases a path
can be normalized to a simpler form, e.g. a//b/./c/../d/
normalizes to a/b/d
.
With that minimal primer out of the way, let’s dive into two issues in the complex path complex [2].
UTF-8 encoding (or not)
Clearly segments cannot have /
in their names, and the only other
exclusion is nul
(\0
). In this modern Unicode world that’s a blessing
in that it’s easy to use UTF-8 in paths, but a curse in that
segment names can be invalid UTF-8. Hemlock thoroughly embraces UTF-8, but the Path
module has to
accommodate arbitrary bytes in segment names. Oh well, at least the mixed encoding wart doesn’t
interact with the other complications.
Special URI-like prefix
While reading Python’s documentation for its
os.path.normpath function I
learned that pathname
resolution
grants special meaning to //
! Mind you, /
, ///
, ////
, etc. all mean the same thing as /
,
but //
is different.
A pathname consisting of a single <slash> shall resolve to the root directory of the process. A null pathname shall not be successfully resolved. If a pathname begins with two successive <slash> characters, the first component following the leading <slash> characters may be interpreted in an implementation-defined manner, although more than two leading <slash> characters shall be treated as a single <slash> character.
This adds all sorts of fun edge cases. For example, //
and ///a
both contain three leading empty
segments, yet only the latter can be simplified (to /a
).
I’m not sure when filesystems started treating the leading //
specially, but I’m guessing it was
somehow inspired by the authority prefix in uniform resource identifiers
(URIs), some time after I was taught the
fundamentals of filesystems. There’s always something new to learn!
Footnotes
-
Only as I wrote about the root directory being its own parent did I realize that the
Path
module wasn’t simplifying e.g./../a
to/a
as it should. Clearly the rubber duck deserved a bigger role in this endeavor. ↩ -
Not to be confused with the fear of over-engineered buildings, aka the complex complex complex. ↩