Elemental Parallelism: ScalPL for exaFLOpS

I stumbled into this interesting video, "Jack Dongarra: On the Future of High Performance Computing", from the recent SPEEDUP Workshop at ETH Zurich. I listen to Jack at almost every chance I get, and if you want to know where the very highest performing machines are going in the next 20 years, he tells you here. I highly recommend watching the whole talk.

On slide 15 (starting at about 30:35 in), he lays out some of the critical issues he sees for peta and exascale computing. At risk of copyright infringement, I'm going to list his six bullets here:

Synchronization-reducing algorithms
Communication-reducing algorithms
Mixed Precision methods
Autotuning
Fault resilient algorithms
Reproducibility of results

ScalPL (Scalable Planning Language, explained in this blog and the new book) addresses five of these six points. (The one it doesn't address: "Mixed Precision methods". After hearing him speak on that topic last year at SC11, it looks too closely aligned with the algorithm itself and numerical analysis/methods work to benefit much from within the runtime system.)

To be fair, it appears that Dr. Dongarra is often referring to algorithm development to address many of these issues, but tools and runtime can offer significant leverage. For example, virtually all of ScalPL is centered around expressing the algorithm in a platform-independent format so that synchronization and communication can be dynamically optimized and reduced to the minimum required by the algorithm and platform themselves. That addresses the first two points. For the autotuning point, (1) "actons" (processes, threads) within ScalPL are functional, and can therefore be independently profiled and modeled to predict their behavior for optimal scheduling, and (2) the scheduling of those actons in a "strategy" (network) can be traced/instrumented efficiently (after static analysis) to help such analyses.

ScalPL really kicks in for the last two points. It can theoretically help with fault detection (e.g. by comparing results of duplicate executions), but that aspect will likely be more effectively addressed via hardware. However, when faults are detected, ScalPL provides (through a technique called supplemental re-execution in the book) a means of preserving the work that has been done, with no global checkpoint/restart, and limited data redundancy to emulate safe storage (for resource contents). And as for reproducibility of results, the book contains an entire chapter examining the implications of determinism and ways to guarantee it, even categorizing types of nondeterminism.

Later in this same talk, at about 35:30, Dongarra talks some about using a DAG (directed acyclic graph) of dependences between tasks to dynamically schedule the larger problems. This is specifically what ScalPL is made for. (The book generally just refers to the DAGs as "computations".) In fact, I mentioned in a recent blog post that even homogeneous machines can benefit from dynamic dataflow scheduling instead of static synchronized "loop at a time" scheduling. (I was going to call it "fork-join parallelism" in that post, as Jack does in his talk, but web references to that term often confuse it with a form of recursion.) Dongarra's talk here illustrates exactly what I was talking about.

(I might mention that Dongarra and I have been on parallel paths on these issues for quite awhile. Back in the late 80s, he and Danny Sorensen were working on a tool called SCHEDULE to facilitate such DAG-related scheduling, while Robbie Babb and I were working on LGDF/Large Grain Data Flow to help automatically generate such DAGs in an architecture-independent fashion. Both his work in this video, and ScalPL in my book, seem to be natural progressions of each of those.)

I guess the point I'm hoping to get across here is that ScalPL's best days are ahead of it. It is here to address the issues of the coming decades of computing.

Elemental Parallelism

Sunday, September 23, 2012

ScalPL for exaFLOpS

No comments: