Wednesday, March 20, 2013

How ex-treme is exa-scale?

[I drafted this some months ago, but for some reason didn't post it. It still looks good to me, so here goes.]

At the SC12 conference in November, I attended a couple of BoFs (Birds of a Feather meetings) regarding exascale. The first, an Open Community Runtime (OCR) for Exascale, was helpful in providing pointers, such as to this website which in turn points to slides, which in turn led me to sites for Intel's Concurrent Collectives (CnC) (including UTK) and University of Delaware Codelets/Runnemede work. But I found myself wincing at some of the things said during the BoF, and after looking at these sites and documents, I admit to being disappointed and, frankly, frustrated. This perhaps became even more pronounced in the second BoF, Resilience for Extreme-scale High-performance Computing. (In hindsight, there are others I wish I had caught.)

First, I would contend that even if we were to come up with a design for The Correct Programming Model(tm) and The Correct Runtime Support(tm) today, it could still realistically take until 2020* or so to develop the appropriate software, support, migration, software engineering techniques (debugging, tracing, patterns, methodologies, libraries, etc.) to make them useful. So there's no time to lose. And from what I was seeing, these groups are not only a long way from having The Correct design, in some ways they seem to be actually going back into history. More specifics about that in an upcoming blog entry.

(*As I write this, I find that the 2020 goal may be more like 2022, but my argument still holds.)

Second, in order to come up with The Correct design, we must understand the critical constraints, not saddle ourselves with non-critical ones and unwarranted assumptions. This is where people like Seymour Cray (starting with "a clean sheet of paper") and Steve Jobs (whether considered a visionary, big thinker, or "tweaker") excelled. Instead, in these meetings, I was hearing "requirements" such as the ability to run MPI programs effectively, or the desirability (necessity?) to use checkpoint/restart to achieve resilience. Even if backwards compatibility may be desirable in some cases, and existing practice may be a useful jumping off point, we will get nowhere by using them to guide the design.


Third, but strongly related to the above, it struck me that I was seeing lots of familiar faces in these BoFs, mostly from my days (a decade+ ago) on the MPI2 Forum. And while reconnecting with former colleagues is part of what the SC conferences are all about, using a standards-based approach is almost certainly not the best way to make significant advances in technology. New approaches will be required to achieve the goals, and standards for those can be established only after those goals are attained. (I could further argue that existing standards have artificially interfered with our progress in this field in other ways, but I'll save that rant for another day, except to say that standards committees are constantly bombarded with the reminder that their job is to standardize existing practice, often while maintaining compatibility with previous versions of the standard. That's not the recipe for quantum leaps!)

So what is the most effective way to achieve the exascale software goals? One might take a hint from Seymour Cray, in the link above: "Shunning committees, he felt that the best computers were the ones where a single architect offered a unified vision. After the machine had been delivered, it was then appropriate, Mr. Cray felt, to listen to feedback from customers and, if necessary, start over from 'a clean sheet of paper.'" I claim that there is no reason to assume that the same wouldn't also be true for software.

In other words, try some big ideas, in multiple (at least two) stages, assume that you may not succeed in the early (e.g. first) attempt, but will learn a lot nonetheless, and integrate what you learn into later stages. That is, in fact, how we fashioned the strategy for NASA's Information Power Grid (back when I had some influence there)... before that plan was apparently revised (and discarded?).

Of course, it could be argued that this "think big with master architect" approach is precisely that which was applied in DARPA's HPCS program, by funding individual companies and allowing them full control over their product/proposal. It could also be argued that that program (has) had limited success (though I would hardly call it a failure). And to those arguments, I would counter that the goals that were not conservative were underspecified. There was little advantage to the participants adding more objectives than those provided by the sponsors, and in fact, disadvantages, in that their competitors for the funds would then be working toward simpler goals.

If failure is not an option, then the goals are too conservative. Consider the manned lunar program: HUGE goals, and a really concrete way to measure if we'd achieved them or not. We were proud to meet them, but it was by no means assured. The exascale project has set some very high hurdles regarding power consumption/energy and reliability/resilience, which seem to fit into this model, but again, at least the goals I've seen on programmability are either conservative or vague or both. Of course, in a political climate, failure to meet a goal can be painful. And where there is public money, there is (and should be) politics.


And, like the moon landing, the trade-off between expense and benefits of attempting to meet (or actually meeting) the goals is a separate issue from whether the goals can be met at all, and the balance is not necessarily clear-cut. If the US government is the customer, they must decide how much it is worth to them, and other participants must decide if it fits their business (and/or research) model. The manned space program was run from within the government. Such a model seems unlikely today.








No comments: