Scheduling Multithreaded Programs on Bandwidth-Limited Networks

by Bradley C. Kuszmaul

I show how to run multithreaded programs on a DRAM (Distributed Random Access Memory) parallel computer and demonstrate that such programs can run efficiently on a collection of machines distributed across thousands of miles over the internet. Suppose we have a fully strict multithreaded program has work $T_1$ and critical-path length $T_\infty$, and we have a $P$ processor DRAM machine with $\lambda$ an upper bound to the cost of routing any permutation. I present a deterministic conservative DRAM scheduling algorithm that runs in time $O((\lambda+\log P)(T_1/P+T_\infty))$ and a randomized conservative DRAM scheduling algorithm that runs in time $T_1/P+O((\lambda{}+\log P)T_\infty)$. Mike Bernstein and I have modified the Cilk multithreaded runtime system to use our randomized conservative DRAM scheduler. Surprisingly the modified system, called TreeCilk, often achieves a performance improvement when one 2000-mile-away machine is added to a tightly-bound cluster of machines. Also, surprisingly, the bandwidth required across the long-haul internet links is only a few hundred bytes per second for the knary benchmark.