Hemlock is (aspirationally) a general-purpose programming language, a category already abundantly populated with options sufficient for almost any task. How can an upstart language project out-general the generals? And can the attempt even be justified? I’m here to argue that these existential questions have good answers. Read on for some history and an elevator pitch perhaps best suited to dallying elevators.
Have you ever faced an embarrassingly parallelizable programming problem, only to write a straight-line solution because the tools at hand make parallel execution too much trouble to accommodate? Earlier today I gave up getting MP3 files to play instead of screeching on my newly installed kubuntu 22.04 system, and cobbled together the commands to convert the files to FLAC format. Conversion of the ~1200 files took long enough for at least one tea break… using a single CPU core of the 128 cores in this machine. With some additional effort GNU parallel could have cut the wall time to under a minute, but that additional effort wasn’t quite justified.
While developing tools such as Crux tool for phylogenetic inference I settled on the following rule of thumb:
When utilizing a computationally intensive software pipeline, don’t seriously entertain retrofitting the pipeline to a Beowulf cluster unless the latency is >100X beyond comfort.
This stemmed from two observations:
- Time spent retrofitting is time not spent optimizing the core code. Even algorithmically sound research software can often be sped up by ~10X on a single CPU just by paying close attention to memory locality and addressing bottlenecks indicated by execution profiles.
- The Beowulf clusters available to me had at most 512 CPUs each, so the maximum theoretical speedup on offer was less than 1000X. That left a narrow range of performance problems for which parallel computation was the easy answer.
This sparked in me a desire for some technology which would routinely offer such a huge latency advantage that doing the extra parallelization work would routinely pay off. For nearly a decade I hoped that would take the form of high-level languages effectively targeting field-programmable gate arrays (FPGAs). Alas, this idea was met with skepticism every time I suggested it to a colleague with some knowledge of FPGAs. Nonetheless I nursed the idea along and starting learning enough about FPGAs to ascertain the scope of developing a prototype implementation.
This early FPGA exploration took two surprising turns. First it became abundantly clear that we don’t even have robust software systems for writing high-level parallelizable code, and that’s a prerequisite for effectively targeting FPGAs. So I narrowed my focus to that prerequisite technology and started filling in programming language knowledge gaps to inform the solution. The second surprising turn came as I finally grokked the fundamentals of functional programming via OCaml. Up to that point I had thought of parallelizing optimizations as being enabled via opt-in regimes (e.g. denoting functions as pure, data as constant or read-only, etc.). But purely functional programming offers the possibility of an opt-out regime, and this is a surprisingly practical option that when pursued to its logical conclusion results in a system that incidentally preserves a great deal of inherent parallelism. This is just what the automatic parallelism doctor ordered!
Unfortunately OCaml didn’t even come close to being pure enough for my purposes, an obvious example being the lack of immutable arrays, and the more I looked at how to retrofit OCaml, the more intractable the prospect appeared. Thus Hemlock was born.
Time for the elevator pitch!
Hemlock is a general-purpose programming language that executes programs in parallel when possible. And it’s often possible. Dependency-free Hemlock code is usually easier to write than code with false dependencies. Coherent language design choices enable a powerful symbiotic whole:
- Data are immutable by default, and functions are effectless by default. Thus large portions of a program’s call graph are typically pure.
- The type system provides a comprehensive effects/mutability model such that mutable data and effectful functions have well specified impact on the program as a whole. Neither the programmer nor the compiler has to rely on global paranoia for correct results.
- Data are reclaimed via automatic garbage collection (GC). Most data are immutable, and immutable data are acyclic by construction.
- The global heap is immutable, and its GC overhead is directly proportional to allocation volume, thus scaling linearly with overall computation.
- Actors are the foundation for all parallel execution. Actors communicate exclusively via asynchronous message passing of global heap values.
- The runtime automatically executes programs in parallel when compute resources are otherwise idle and data-independent portions of the program execution exist.
Oh, here’s your floor, and there wasn’t even time to mention several tidbits that could be blog posts all by themselves, let alone wax poetic about the fundamentals I did mention. Catch you on the way down?