Elemental Parallelism: So why parallel (instead of concurrent)?

My last blog entry focused on the ongoing confusion between parallelism and concurrency. At the very least, I hope I convinced you that parallelism is a subset of concurrency -- i.e. that all parallel programming is also concurrent programming. So, what, specifically, are parallel programmers trying to achieve beyond concurrency, and why?

A little history is in order. Parallel programming emerged in an age where the number of practical architectures/platforms for supporting concurrency was limited, as was the technology for programming them. By and large, these platforms consisted of a handful to hundreds of identical (homogeneous) sequential processors tied together with a uniform communication network, to support either a shared memory or message passing paradigm. Typically, the entire platform (or some well-defined portion of it, in an approach called space-sharing) was devoted to running one program at a time.

To make the most of these exclusively-allocated processors while reining in complexity, approaches like SPMD (Single Process Multiple Data) or SIMD (Single Instruction Multiple Data) were used to program these machines, where the same instructions (or at least identical programs) ran on several processors at the same time, but with each processor operating on a different part of a larger data set (or parameter space). Programs were often written using an approach called loop-level (or data) parallelism, where all processors were started on some part of a task (e.g. to process some data structure) at about the same time, generally to deliver their contributions to some specific processor at one time when finished. This was often at least partially facilitated by assigning different parts of large data structures (typically arrays) to different processors, signifying where operations to those structure elements would be performed. Communication was optimized for these so-called "collective" cases so that information could be efficiently distributed to and/or collected from all the processors at once. A program consisted of one or more of these collective distribute-compute-collect steps.

Software tools and languages facilitated this approach, using either SIMD languages (like Thinking Machines' C*) on specialized architectures, or more commonly (to whatever extent feasible), executing different iterations of a normally-sequential loop on different processors concurrently instead of in their default/native sequential order, using new languages (or extensions to traditional sequential ones) like HPF and OpenMP. Tools such as The Force, BSP (Bulk Synchronous Parallel), and collective operations in MPI also supported this collective-execute-and-collect paradigm. All-in-all, this assumption that a program had an entire, well-defined, homogeneous machine all to itself, and that the programmer was largely responsible for exploiting that machine (in part by optimizing the scheduling of operations on its different processors) to speed up the result, came to be considered as "parallel programming".

Fast forward to the present (and, if you like, future). Processors are ubiquitous. Communication is ubiquitous. PCs, tablets, and even phones have multicore chips (i.e. multiple processors). GPUs (Graphics Processor Units) that were originally intended for, well, graphics are being retargeted for more general computation in cases where there is a good application/architecture match. Our music, pictures, documents, etc. are often stored and processed in "the cloud", along with web searches and apps, on perhaps thousands of individual (generally PC-sized) machines handling perhaps millions of requests from different users. The number of processors in these larger platforms (e.g. in the cloud) together with the rate of technology advancement make the economics of homogeneity restrictive: Rather than replacing the entire platform at one time when parts go bad or are outdated, it's often more economically feasible to replace them piecemeal, with the technology of the day, but this leads to a hodgepodge of different technologies and characteristics. Virtually all the assumptions of traditional parallel computing are out the window, and with them, the effectiveness of traditional parallel programming approaches.

But it's worse than that. Even if platforms were still largely uniform collections of homogeneous processors in a uniform topology and roughly uniform latency (memory access time) -- and some still are -- those traditional parallel programming approaches would still be suboptimal, as they always were. First, the goal of high-level programming (and algorithms) from their very inception has been to divorce the correctness/viability of the program/algorithm from the architecture used to implement it, and these parallel programming approaches fail that, often being dependent both on the number of processors and the latency and semantic characteristics of their interconnection (e.g. message passing, NUMA or UMA shared memory). Second, limiting concurrent expression to fork-join or one-loop-at-a-time execution often forgoes the opportunity to fully exploit the available processors, such as when multiple loops or a loop and other less parallel sections could execute concurrently: These approaches were largely chosen as a quick-and-dirty way of reining in complexity, and do not fit all computational needs, being relatively unsuitable for a large class of programs (e.g. employing software pipelining or other kinds of functional decomposition). Concurrency can scale with data without expressing it as "non-looping loops". And third, the time for any program to achieve a solution includes the time it must wait around for a platform to become available for the program to even start running, which goes way up if an entire machine must be dedicated to one application at a time (or even space-shared between them).

So then, what, really, is the motivation for using these limited parallel programming approaches over concurrent ones? I would claim: Primarily laziness. Laziness in developing truly flexible expressive ways of representing concurrency algorithmically, independent of platform, that encompass loop-level parallelism as well as other kinds of concurrency, and in developing the necessary systems software (or middleware) and methodologies to have the resulting programs run (and share platforms) efficiently.

A programmer should be revealing the available concurrency in their program, their algorithm, and the platform (hardware and system software and middleware) should exploit that, in the context of the current conditions, including the availability of processors and other load on those processors. If a human needs to inject some guidance, some high level knowledge, to facilitate that algorithm-to-platform mapping, then fine, but let it be in a separate step, rather than within the expression of the algorithm itself. We've been fully accustomed to having that algorithm/platform independence in sequential programming for decades. It's time to get real in the concurrent world -- and I would argue that approaches like Scalable Planning Language (ScalPL) allow us to do just that.

Elemental Parallelism

Friday, September 07, 2012

So why parallel (instead of concurrent)?

No comments: