BranchTaken

Hemlock language insights

Work Stealing - Actors

In this post, we’ll show how we support work stealing from within Hemlock actors’ native stacks. The examples here are directly based on the sum function example in the Work Stealing - Intro post. In summary, it’s a recursive divide-and-conquer implementation that sums sub-sections of an array uns together. We’ll go with an example where the input is very large.

let arr = gen_random_vals 1_000_000_000
let s = sum arr

Given that this is a recursive divide-and-conquer approach, each level of recursion accounts for a logarithmically decreasing slice of the overall work to do. The top few levels of the execution stack alongside the post-order work tree would look as follows early on in the execution of sum.

# Ranges (e.g. 0..1B) represent the indices in `arr` to sum.
+-------------+
| 0..1B       |                   ____________0..1B............
+-------------+                  /                             '
| 0..500M     |            ___0..500M...                   500M..1B
+-------------+           /             '
| 0..250M     |       0..250M       250M..500M
+-------------+      /       '
| 0..125M     |  0..125M   125M..250M
+-------------+
|repeat...    |
:             :

On a single processor, the entire work tree would indeed be executed post-order by a single actor. However, we already established that each subtree is pure work, so each sub-tree is computable in parallel and that results from the left and right subtrees can then be merged. In a stack like this, stealing the pure right sub-trees near the top of the stack and calculating them in parallel would net us a huge performance win. By itself, the right sub-tree of the top frame accounts for half of the overall work. In this example, that’s a lot.

Left continuation/right continuation/quantum generation

For each level down the work tree, there’s a progressively smaller amount of work to steal. At some point, the right continuation (RC) has so little work to do that it’s faster to not steal it since the overhead in stealing an RC outweighs potential gains. Some languages like Deterministic Parallel Java allow the programmer to hint to the runtime how to determine whether there’s enough work to steal. In Hemlock, determining whether work is worth stealing is automatic via a simple heuristic. An RC that has already survived on the stack for some amount of time is likely worth stealing. The language makes heavy use of map and reduce algorithms, and this rule of thumb works especially well in these cases. If the left sub-tree (left continuation (LC)) takes a long time, the RC should take a similarly long time. If the LC takes very little time, the actor simply executes the RC before it can be stolen, just like you’d expect of back-to-back function calls in C.

The actor tracks how long an LC has been executing by embedding a marker in the execution stack, what we call a quantum generation (QG). The QG allows the executor to track how many quanta an RC survives on an actor’s stack. The executor is able to initialize the value and later inspect each QG on an actor’s stack. If the RC survives long enough (e.g. some number of quanta as indicated by the value of QG), it’s a candidate for work stealing. RCs lower in the stack have survived longer and (especially due to the nature of divide-and-conquer algorithms) are considered better work-stealing candidates than RCs higher in the stack.

Caller/callee-saved registers

QG isn’t the only bit of information we need to embed in the execution stack to make work-stealing RCs possible. In normal exeuction when work isn’t stolen, an actor returns from a function call (the LC) and calls the next function (the RC). There is no data dependency between the LC and RC, so the return value of the LC isn’t needed for the RC. However, register state between the LC and RC may be important. Normal register spill and fill mechanisms that we’re familiar with in C work likewise in Hemlock to ensure correctness when returning from the LC and calling into the RC. To steal the RC though, we need to keep enough information around to bootstrap the stealing actor with the correct register state as if it had just returned from the LC.

Some register state is already stored on the stack. In normal calling conventions, there are some registers that are callee-saved and some that are caller-saved. Like C, Hemlock saves the caller-saved registers on the stack before making any function call. However, it’s also normal for functions to use callee-saved registers as long as it restores that state before returning. The only way for Hemlock to have that information at the time it steals an RC is to also save any important callee-saved registers to the stack along with the caller-saved registers. With that, we’ve put enough on the stack to find the best work-stealing candidate RCs and actually start up the actor that executes the stolen work.

Merging results

The last thing to handle for work stealing is to change the behavior of an actor upon returning from an LC and finding that the RC has been stolen. A stolen RC always returns before the LC. That means there are only two cases when returning from an LC. Either the RC wasn’t stolen or the RC was stolen and the result is already waiting. Fortunately, it’s pretty easy to change the behavior for the actor returning from the LC. When stealing the RC, Hemlock also rewrites the program counter (PC) associated with the victim’s stack frame. Instead of a PC that directs the actor to immediately execute the RC, it has a rewritten PC that directs the actor to merge the already-completed RC result and put the stack into the correct state as if it had just completed the RC.