Work Stealing - Determinism
This is the final post for the “Work Stealing” series. In the Work Stealing - Type System post, we went over language semantics that provide our compiler all the information needed to safely perform work stealing. In this post, we’ll cover a a few interesting aspects of determinism with work stealing from a systems perspective.
Little Constraints, Big Freedoms
Automatic parallelization is an optimization that is enabled by a small, simple set of language constraints. Mainstream languages can’t pull off work stealing as they lack necessary constraints or semantics to give their compilers the knowledge that certain optimizations are valid. Potential optimizations may be usually valid, but unless they’re always valid, the optimization is broken and unusable. I can point to a few common language features that make work stealing and other smaller optimizations impossible.
- C has mutable global data. It must invalidate registers referring to global data when returning from a function since the function callee may have modified it. Global data may be touched by almost any function or thread.
- Go has no
const
keyword. You simply cannot guarantee that a function callee is not modifying its parameters. Two back-to-back function calls cannot be reordered nor parallelized. - No language that I know of keeps track of whether a function will make a system call. Even a
function that uses something as simple as the
time(2)
function can change its output.def foo(): t0, t1 = time(), time() assert t0 <= t1
Reordering the two
time
calls would be disastrous. Nothing in their signature warns the language about that. It simply has to assume this is true for all sequential function calls.
Hemlock, however, has all of the necessary restrictions to make work stealing possible.
- Global data is immutable. It is not possible for one function or thread to modify global data in a way that can undermine the assumptions of another function or thread. Updates in global data may only be propagated between actors via message passing. This means Hemlock does not need to perform the same pessimistic register invalidation as C upon returning from a function call. Just as importantly for work stealing, this means the rug does not get pulled out from under actors’ feet (e.g. global data mutation) at unexpected times.
- A function is required to specify that it mutates its inputs and when it will do so.
val foo: &bytes -> (unit >!&bytes-> unit) let closure = foo bytes in (* Does not modify bytes *) let _ = closure () (* Modifies bytes *)
The above example takes in mutable data, but doesn’t mutate it directly. Executing the closure it returns does mutate the data. Having such knowledge is required for the compiler to know whether such functions are valid to reorder or parallelize.
- A function’s observable side-effects or dependencies on anything other than its inputs is tracked
by the type system. A Hemlock function that makes a system call has an
>os->
effect. It’s pretty simple. Two functions that have such effects cannot be reordered nor parallelized via work stealing as we cannot guarantee that it is valid.val get_ns_since_epoch: unit >os-> u64 match (get_get_ns_since_epoch ()) <= (get_ns_since_epoch ()) with | true -> () | false -> halt "Time does not flow backwards."
Effects on System Resources
We made the assertion that it’s safe to automatically parallelize pure computation. That’s true. Usually, the only observable effect you can see is that the pure computation finishes faster. There are effects on other system resources to consider.
- I/O (disk or network)
- Memory
The intent with Hemlock to efficiently use as much CPU as possible. Aggressively parallelizing pure code does a great job of doing that. While only pure code is parallelized, the results of pure code may be written to the disk or network, so the increased speed provided by work stealing can cause a proportional increase in I/O utilization. What is great about both CPU and I/O is that hitting 100% utilization simply means we are working as fast as we can and can’t go any faster. The negative consequences are resource starvation. It’s a real problem, but easily the lesser evil compared to poor resource utilization (and the countless wasted programmer hours spent on manually expressing every detail of parallelism to utilize resources better).
Unfortunately, memory utilization may also increase proportionally. Hitting 100% memory utilization means that we hit allocation failures if we try to use any more. The negative consequence is process failure. There are technically-valid mechanisms to mitigate the out-of-memory failure case. For instance, it is valid to reap delegates that are not yet finished with their stolen work simply to free up system resources. It would then be up to those delegates’ associated victims to perform the work with all partial solutions from the delegates being lost. It would be an unfortunate sacrifice to make, and ultimately counterproductive.
While considering mitigation strategies, we realized that we should avoid performing such heroics within the Hemlock runtime. For a short time, our perspective regarding automatic parallelization and program correctness was flawed. The natural paradigm shift that Hemlock provides is that automatic parallelization is the baseline correct behavior and need not be considered magic. In Hemlock if a program runs out of memory, it is a problem with the program that should be fixed, not an instability to blame on work stealing.
Hemlock will have sane mechanisms to avoid out-of-memory failures due to hugely parallelized pure computation. Supervisor startup strategies will enable programmers to set more targeted memory and concurrency limits on delegates. With that, it should be straightforward for programmers to tame work stealing so that it does not run rampant and take down the system.