Elemental Parallelism: ScalPL As Straw Man (or Channeling Bill Dally)

Recent interaction with a former (and future?) colleague prompts me to clarify some positions expressed here. For example, some might interpret my frequent mention of ScalPL (a language/technique which I developed) as a sweeping proclamation of its superiority over other approaches. Rather, my intent is to highlight a number of issues which must be addressed for truly high-performance computing (e.g. exascale), and then to use ScalPL as one potential way to address them. It is, in a sense, a straw man: If you have another approach which handles these issues better, I'd be very interested in hearing about it. Even if you don't know of a better way, I will not be surprised to see someone (even possibly me) recommending improvements to ScalPL, perhaps relegating ScalPL in its current form to a stepping stone to something better, or just sort of an architecture-independent intermediate language.

What frustrates me is to instead hear "I agree that we need to address those issues, and I don't know of any other way to address all of them, but don't bother me with yet another programming approach (or at least, not one so different from existing approaches)." Or, simply, "I don't think we'll need to address those issues", with no further support for that assertion.

So, what are the issues that I claim must be addressed? I hit them one at a time in the book, but I am just as happy to defer here instead to Bill Dally. Many posts ago, I mentioned one of his talks that I saw at SC12. More recently, I found a video made by him at about the same time on the same topics, as well as another of a much more recent talk. Obviously, I didn't design ScalPL based on these talks, but for the sake of argument, let's say I did.

I may not be the most objective listener, but what I hear Dr. Dally saying here is that:

Exaflops machines cannot afford to spend energy (and expend heat) for OoO logic to address latency issues (on the fly) that could instead be handled statically (in advance) by programming, preprocessing, and optimization -- i.e. by scheduling the data which will be needed close to the computation before the computation needs it.
Significant concurrency and locality information will need to be extracted from programs to achieve exaflop performance goals.
Programs should not be expressed in a way that would require them to be recoded for new architectures. Instead, the program should code the desire algorithm, optimization tools should then sensibly embed that algorithm on a particular platform, and (if necessary) a human tuning expert can further optimize those embeddings and/or scheduling by hand, still (optimally) without modifying the initial coding.

The last two points (at least) are not so different from what many of us have been saying for a very long time. As for the second point, regardless of whether the programmer expresses the concurrency and locality explicitly, or a subsequent tool or language processor extracts them from a more obfuscated representation in the program, at some point, there should be a representation which embodies these factors.

I admit, I have taken some of these a step further than Dr. Dally does in these talks. For example, if we're not asking processors to stall or context-switch (for any significant period) to compensate for latency, should we ask them to do so for inter-process communication? I think not. Fortunately, they're related: If you know what data a computation (i.e. thread snippet or "threadlet") will need before it needs it, including that protected by critical sections, you might just as well wait until those sections are also available before starting execution. That is also a step toward making each of the snippets/threadlets "two-phase" (accessing all inputs before producing any outputs), and therefore sequentially consistent, easing subsequent reasoning about their combined execution.

Then there's the form of communication between these snippets/threadlets. Message passing exacts overhead, unneeded if the snippets/threadlets end up close to each other, and especially in cases where information could just be shared (without any copying). In a sense, it cements in the granularity of these snippets: They can never again be recombined into larger grains to make that overhead disappear. Some sort of PGAS model makes more sense, given the previous discussion (e.g. with data controlled by critical sections), especially if it can be made to act like message queuing where useful, to use queuing and copying if and only if it compensates for latency or increases concurrency. (ScalPL gets the best of both worlds through careful formal semantics that can be implemented in different ways while having the same logical outcome.)

There are a few other more tangential issues, like also avoiding overhead related to checkpoint/restart and/or deadlock prevention/detection. But those sort of fall out automatically, once you handle the other issues cleanly. That is, if the computation is broken into deterministic snippets/threadlets, and they're two-phase, then as long as we keep old (replicated) input data around sufficiently long, we can recreate new (unreplicated) data through re-execution until it is adequately replicated, without time consuming snapshots and back-outs. And if we know all of the data needed by a snippet before it executes, then deadlock can be completely avoided by atomically acquiring it all as part of initiating execution.

A rather subtle point of all of this is that these preconditions to execution of a snippet/threadlet are (practically) sufficient in themselves: By making our critical sections (locks) just a little more flexible by adding more states if necessary (other than just "locked" and "available"), we can toss control state completely -- or, put another way, these new lock states become the control state in a more distributed fashion, in deciding when a particular snippet/threadlet should execute.

That's the brunt of ScalPL. The graphical syntax is just a natural representation, once it is determined that the execution of any specific snippet/threadlet ("tactic" or "acton" in ScalPL) is determined completely by the states of the data items ("resources" in ScalPL) that it accesses. Obviously, there are other ways to address the main goals than the ways ScalPL has chosen, but I will be surprised to see another so inclusive, elegant, and flexible. Other constructs in ScalPL are mostly there for software engineering purposes (modularity, OO). I would really like to see other people propose other ways to address all of these issues, or to suggest improvements to ScalPL without violating them -- or, heck, to explain why it's unnecessary to address them.

Elemental Parallelism

Thursday, May 01, 2014

ScalPL As Straw Man (or Channeling Bill Dally)

No comments: