Sunday, September 23, 2012

ScalPL for exaFLOpS

I stumbled into this interesting video, "Jack Dongarra: On the Future of High Performance Computing", from the recent SPEEDUP Workshop at ETH Zurich.  I listen to Jack at almost every chance I get, and if you want to know where the very highest performing machines are going in the next 20 years, he tells you here.  I highly recommend watching the whole talk.

On slide 15 (starting at about 30:35 in), he lays out some of the critical issues he sees for peta and exascale computing.  At risk of copyright infringement, I'm going to list his six bullets here:

  • Synchronization-reducing algorithms
  • Communication-reducing algorithms
  • Mixed Precision methods
  • Autotuning
  • Fault resilient algorithms
  • Reproducibility of results

ScalPL (Scalable Planning Language, explained in this blog and the new book) addresses five of these six points.  (The one it doesn't address: "Mixed Precision methods".  After hearing him speak on that topic last year at SC11, it looks too closely aligned with the algorithm itself and numerical analysis/methods work to benefit much from within the runtime system.)

To be fair, it appears that Dr. Dongarra is often referring to algorithm development to address many of these issues, but tools and runtime can offer significant leverage. For example, virtually all of ScalPL is centered around expressing the algorithm in a platform-independent format so that synchronization and communication can be dynamically optimized and reduced to the minimum required by the algorithm and platform themselves.  That addresses the first two points.  For the autotuning point, (1) "actons" (processes, threads) within ScalPL are functional, and can therefore be independently profiled and modeled to predict their behavior for optimal scheduling, and (2) the scheduling of those actons in a "strategy" (network) can be traced/instrumented efficiently (after static analysis) to help such analyses.

ScalPL really kicks in for the last two points.  It can theoretically help with fault detection (e.g. by comparing results of duplicate executions), but that aspect will likely be more effectively addressed via hardware.   However, when faults are detected, ScalPL provides (through a technique called supplemental re-execution in the book) a means of preserving the work that has been done, with no global checkpoint/restart, and limited data redundancy to emulate safe storage (for resource contents).  And as for reproducibility of results, the book contains an entire chapter examining the implications of determinism and ways to guarantee it, even categorizing types of nondeterminism.

Later in this same talk, at about 35:30, Dongarra talks some about using a DAG (directed acyclic graph) of dependences between tasks to dynamically schedule the larger problems.  This is specifically what ScalPL is made for.  (The book generally just refers to the DAGs as "computations".)  In fact, I mentioned in a recent blog post that even homogeneous machines can benefit from dynamic dataflow scheduling instead of static synchronized "loop at a time" scheduling.  (I was going to call it "fork-join parallelism" in that post, as Jack does in his talk, but web references to that term often confuse it with a form of recursion.)  Dongarra's talk here illustrates exactly what I was talking about.

(I might mention that Dongarra and I have been on parallel paths on these issues for quite awhile.  Back in the late 80s, he and Danny Sorensen were working on a tool called SCHEDULE to facilitate such DAG-related scheduling, while Robbie Babb and I were working on LGDF/Large Grain Data Flow to help automatically generate such DAGs in an architecture-independent fashion.  Both his work in this video, and ScalPL in my book, seem to be natural progressions of each of those.)

I guess the point I'm hoping to get across here is that ScalPL's best days are ahead of it.  It is here to address the issues of the coming decades of computing.

Tuesday, September 11, 2012

Scalable Planning Language (ScalPL) = F-Nets + ...

I recently described F-Nets, as well as some reasons I preferred that model to serve as the basis for concurrent/parallel programming.  But I obviously didn't think it was sufficient on its own, or I would have stopped there. So, just what is ScalPL, and how is it different (superior to) F-Nets for programming?  It starts with F-Nets, and adds...

Modularity
Instead of activating just tactics within a context, in ScalPL an entire F-Net fragment (i.e. a network of tactics and resources) called a strategy can activate within a context.  In other words, a strategy combines other constructs into a single entity/module, and ScalPL defines how the activity (resulting from its activation) interfaces to its context's surroundings (via the role bindings on that context). Among other things, this means that we can create a module (strategy) to behave any way we like with regards to patterns of accepting inputs or producing outputs, or maintaining internal state, or acting (non)atomically or (non)deterministically, rather than being tied to only atomic deterministic modules (i.e. tactics).

First Class Modules
Together, tactics and strategies are called plans.  And since plans (before they are activated) are just a form of static data/information, they can be stored on resources.  Instead of labeling a context with the plan that is to be activated within it, a strategy can be asked to activate the plan that is present on one of its resources.

Clean Termination, Atomic Activation
Since strategies can activate, they should also be able to terminate -- so that they can be reactivated (cleanly) or something else can activate in the same context.  Also, it should be easy to control the activation of a strategy the same way as for a tactic, and in particular, to make it easy to activate a strategy atomically.  ScalPL addresses both of these in simple, logical ways.

Object Orientation (Class Based)
Since a strategy is a first-class entity encapsulating both behavior (i.e. contexts containing plans) and state (i.e. resources), and having a well-defined interface, it is almost an object.  However, to truly be an object, it needs to be able to carry its state with it as it is activated in different contexts, so ScalPL introduces instantiation levels (and instantiation) for resources.  With these, and a few methodologies to deal with them wisely, ScalPL becomes an object-oriented language, with meta-classes.

Arrays
Inherent in the F-Net model is an assumption that each context is statically bound to a fixed set of resources, and while that may be theoretically sufficient, it doesn't  practically.  At the same time, allowing the role bindings to be determined after activation re-introduces all sorts of complications (with deadlock, etc.) that the F-Nets execution model cleared up.  ScalPL bridges the gap on this by introducing array resources (i.e. sets of resources, each an element of the array, addressable via indices) and a method of specifying which resource elements to which a context is to bind as part of the activation itself.

Data Concurrency/Parallelism
Having arrays isn't enough for data concurrency:  For that, one needs a way to scale the amount of concurrency with the number of elements being processed.  ScalPL provides ways to duplicate/clone contexts, and how many plans they should activate collectively (e.g. one among them all, or one each).

Dynamic Resource Allocation
Since there is no specified spatial relationship between different elements of a resource array (since they could even end up on completely different processors), and the initial control and content state of each resource element is known before/until it is accessed, resource arrays can be (and are, by default) infinite in size, and only those elements actually accessed (and modified) are actually tracked.  This means that elements can be allocated only as they accessed (i.e. bound to).

Each of these properties conforms roughly to chapters in the Scalable Planning book.

Of course, many sequential OO programming languages have the above constructs and/or properties, and a typical approach to constructing a concurrent language is to start there and augment it with concurrent constructs.  That doesn't usually work very well.  Here, we've instead started with a simple computational model (F-Nets) that properly handles the concurrency issues, and have then built it into a complete language.

Monday, September 10, 2012

Why F-Nets

Now that I've provided a short and sweet description of the F-Nets computational model, here are some reasons I think the F-Net model serves as an excellent starting point for concurrent programming (and specifically why I used it as the basis of ScalPL).



Architecture-Independent Communication
The memory model in F-Nets (expressed in the semantics of its resources) does not restrict or favor how data gets moved from one place to another, such as message passing or shared memory (or something in between).  There is no assumption that data will be copied (with the commensurate overhead) as a byproduct of communication, as with message passing, nor is there any need to mess with (and wait for) locks, as with shared memory.  Runtime systems can often dynamically trade off copying with blocking to optimize performance.

Variable Granularity
In large part because of the flexible and efficient communication model, there is virtually no penalty for using the same memory model for static data.  As a result, if communicating entities are colocated, from a granularity perspective they essentially merge into one entity.  This allows the granularity of the application to naturally increase (i.e. concurrency to shrink) to match the available concurrency in the platform.

Latency Tolerance
As in data flow models, the processing model together with the memory model does not dictate when data must be somewhere, or even specifically where it should be.  It only states (or strongly suggests) what data and code must be together in one place before computation can occur.  Data movement can all take place between tactic activations, rather than during them, so latency can be made to overlap with other computation within the runtime/scheduling system, and computation in progress never needs to stall waiting for data to arrive, consuming resources and perhaps interfering with other progress.

Language Independence
The processing and communication model ask nothing of the underlying computer language except to do what virtually all already do -- i.e. express a deterministic mapping (e.g. subroutine) from the initial values within some set of data structures (generally passed as arguments) to the final values within some (perhaps overlapping, perhaps identical) set of data structures (also generally arguments).  Well, there is one other minor requirement:  That the subroutine also assign a transition (color, control state, or name) to each of those data structures/arguments, which is easily done by adding a simple statement for that purpose to almost any imperative language.  In fact, this makes F-Nets/ScalPL a fine mechanism for combining different languages in the same application.

Formal Semantics
Because of the simplicity of the memory model, and the fact that tactics are deterministic and each operate on a fixed set of resources, the model is easily described in terms of functions and sets, both axiomatically and operationally, without delving into the internal semantics of the language used to implement tactics.  (Denotational semantics of the language would be used to define the precise function association with each tactic.)

Independence of Specification & Implementation
And because any implementation of a tactic which expresses a desired function is equivalent in terms of the semantics, the precise way it does it, and specifically the order or timing in which any implementation accesses (or declares a new control state for) different resources, is immaterial.: Implementors can code to a high-level specification, with significant latitude. That's very different than other approaches to concurrency, where the ordering of operations can mean the difference between (say) deadlock and none.

Structured Concurrency
Sequential computer languages are generally called structured if it the physical structure of the program (e.g. its indentation) immediately suggests (or restricts the possibilities for) its execution.  More specifically, the program fairly obviously represents a folded up form of its possible executions.  F-Nets are the extension of these same principles to the concurrent world.

Tamed Nondeterminism
Some concurrent languages (e.g. functional) don't permit the expression of nondeterminism.  Others do, but in such an undisciplined way that it can easily "creep" into programs unintentionally, and regardless of how it gets there, it's hard to tell how it will affect the overall execution.  There are clear ways to recognize possible nondeterminism within F-Nets, not only to ensure that the programmer has an opportunity to exclude it if desired, but also to optimize tracing to minimize overhead for so-called "instant replay" debugging.  And even when there is nondeterminism, the semantics still allow for the analysis of the execution as a collection of functional evaluations.


I would challenge the reader to find another basis for concurrency which has all of these advantages.

Sunday, September 09, 2012

F-Nets, the essence of ScalPL

I've been wanting to explain here why I get excited about ScalPL (Scalable Planning Language), to compare it with other approaches, etc.  But to do that, I first need to assume that the reader knows what ScalPL is, even those who haven't read the book.  Fortunately, ScalPL can be separated into its foundational (fairly simple) computational model, called F-Nets (short for function networks), and then the other stuff built on top of that (object-oriented model, arrays, data concurrency, etc.), and I really only need to explain F-Nets to convey many of ScalPL's merits.  I'd like to just convey the raw "what" behind F-Nets in one post (here), leaving the "why"s to become clearer in later posts.  Although I presented most of the facets of F-Nets a month ago from a different angle, I'll subsequently use this post as the definitive "quick and complete description" of F-Nets for future reference.  So without further ado, in a dozen paragraphs...


An F-Net consists of two main kinds of entities: Resources and tactics.  A resource, represented graphically by a rectangle (sometimes with other decorations), is a container -- it is capable of holding something static, which is (unsurprisingly) called its content, or more technically, its content state.  For computers, you can think of a resource roughly like a variable, and its content as its current value.  So, each has an initial content state (either declared or default), and a content domain (like a data type) which describes the kinds of content it can hold.

A tactic is a description of what to do to some resources, so it can be considered as a program (subprogram, function, method, etc.).  And like a program, which can be run (or executed) to become a process (an active entity "doing its thing"), a tactic can be activated (executed) to become an acton (also an active entity "doing its thing").

A construct called a context, represented graphically by a circle, signifies an association between a tactic and a set of resources, by labeling the circle with the name of the tactic, and drawing lines from the circle to the rectangles representing those resources.  Multiple contexts in the F-Net can be labeled with the same tactic name.  The connecting lines, called role bindings, are uniquely labeled (explicitly or implicitly, by labeling the resource instead), and these labels/names are used by any acton in that context (i.e. resulting from activating the associated tactic) to designate the resources.  The behavior of an acton -- i.e. the alterations it makes to the content of its resources via these role bindings -- is completely determined by the tactic from which the acton springs (i.e. the tactic's description of what to do) and the original contents of those resources when the tactic activates into an acton. In other words, a tactic can be regarded as a function which translates/maps the contents of its resources (as identified by those role binding names) when it activates, to the final contents of those same resources when the activated tactic (i.e. acton) finishes.  And since a tactic can be regarded as a function, then a context can be regarded as specifying an application of that function relative to a specific set of resources. And an acton can be regarded as an evaluation of that application of that function.

Technically, it's possible to get by without specifying much more, just letting the tactics activate atomically (e.g. one at a time) randomly or continuously.  In fact, what's been described here so far is already very similar to a computational model developed years ago by K. M. Chandy and J. Misra called UNITY (or similarly Dijkstra's Guarded Commands) where the guards would be built into the tactics.  However, restricting/specifying the circumstances under which each tactic can activate is generally really useful for both understanding and efficiently implementing the F-Net, so there's another piece of the puzzle:  In addition to content state, each resource also has another changeable attribute, called its control state, represented as a color, initially green for each resource.  And, each of the role bindings (lines) connecting a context to a resource also has an associated set of (static) colors, represented as colored dots (called access dots) shown between the end of role binding and the resource rectangle.  Now the activation rule is:  A tactic can activate in a context if and only if all of the context's resources have a control state matching one of the access dots on the role binding connecting that resource with the context.  And, when a tactic does activate, in addition to accessing (e.g. observing, updating, or replacing) the content of its resources, the acton also assigns a new control state (color) to each resource, thereby (perhaps) allowing other tactics (e.g. associated with other contexts) to activate.

That covers when a tactic can activate into a tactic, but must it ever do so?  The liveness rule answers that question:  If a tactic can activate, then it must, within a finite amount of time -- or, another tactic using at least one of the same resources (and meeting the normal activation criteria) must.  In other words, if something can happen, it will/must, as long as something else doesn't interfere.

That is the essence of F-Nets, but there are a few other notations and minor rules which make them more portable, easily understood, and efficient.
  • First, we put arrowheads (called permissions) on all role bindings to designate whether an acton in that context might observe/read the contents of that resource (arrowhead on the circle/context end) and/or will always replace/rewrite the contents of that resource (arrowhead on the resource end).  A role binding with no arrowheads means that the acton won't touch the content state (though it will still assign it a new control state), and arrowheads on both ends signify the other options -- i.e. that the acton might only change parts of the control state, or might only sometimes replace the entire control state. The four combinations of arrowheads are called observe, replace, nodata, and update permission, respectively (i.e. context end, resource end, neither end, both ends).
  • Then, just as arrowheads help to represent what the actons might do to the resources' content states, we use colored dots inside of the resource rectangles to represent what the actons might do to their control states.  That is, near where each role binding connects to the rectangle (with access dots), we put one or more colored dots (called, unsurprisingly, new control state dots) inside the rectangle to represent the new control states that the corresponding acton might assign to that resource.  It is also common to add a legend inside the left end of the resource rectangle showing each color that the resource might ever assume (i.e. its control domain) and what that color symbolizes.
So now, just by looking at the F-Net, you can tell, to a large extent, how each acton might affect, or be affected by, the content and control states of the resources to which it is bound (attached) by role bindings.  The control states make an F-Net act very much like a Petri Net, in that regard.

But there is one more loose end.  If tactics can be considered as programs or functions (specifically what theorists call partial recursive functions), complexity theory tells us that we may not be able to tell, with any amount of analysis, whether an acton (once it has been activated from a tactic) will ever finish accessing the content of any particular resource, or assign it a new control state.  (This is true even if the acton has already been cranking away for an arbitrary amount of time.)  This messes up the idea that we can activate just one tactic at a time:  If one gets stuck, we'd never get around to activating any other tactics, but we might also never be able to know if we can just give up on it, because it might eventually do more if we just give it more time.

The essence of getting around this issue is by simply considering a resource which an acton never finishes with as having a special control state (called bottom or ⊥) represented by no color at all (often shown as black).  Because its can be so hard (or impossible) to figure whether any particular acton will eventually finish accessing any particular one of its resources, it is also generally assumed that any resource could end up with this bottom control state, so it's as though there is an implicit bottom (black) dot added to every set of new control state dots.  However, technically speaking, whether an acton will or will not finish up with any particular resource is still completely determined by its tactic and the initial content of its resources, so even with this so-called augmented control domain, we can still at least abstractly consider each tactic as a (total) functional mapping.  And we can freely allow one tactic to activate even before others have finished, as long as we don't let a subsequent tactic activation use a resource that a previous one (i.e. acton) hasn't finished with yet.

And that brings us to one final rule, called the predictability rule, which greatly increases the efficiency and portability of F-Nets in common cases.  It says simply:  If all of the new control state dots for a role binding have the same color, and that role binding has either observe or nodata permission (i.e. there is no arrowhead on the resource end of the role binding), then we call the role binding predictable, which we take to mean that there is not an implied new control state dot of bottom.  That is, predictable role bindings always finish up.  And because they always finish up, and they don't alter (update or replace) their content state, and they have only one choice for new control state, we know the final content and control state of the associated resource the moment the tactic is activated, even before it has finished up.


OK, that's F-Nets in a nutshell.  There are other things I could get into, like just how to express tactics in traditional programming language, or how to represent F-Nets textually.  There are also other constructs and abstractions: For example, role bindings really represent (surprise!) bindings for things called roles, and there are ways to a make role bindings branch, etc.  But relatively speaking, those extras are all just window dressing.  F-Nets serve as only the most basic foundation of ScalPL, which altogether has much more functionality, power, and expressiveness than F-Nets, but F-Nets alone are plenty enough to explain many of the advantages of ScalPL over many other approaches.  And now onto the beauty of this approach...











Friday, September 07, 2012

So why parallel (instead of concurrent)?

My last blog entry focused on the ongoing confusion between parallelism and concurrency.  At the very least, I hope I convinced you that parallelism is a subset of concurrency -- i.e. that all parallel programming is also concurrent programming.  So, what, specifically, are parallel programmers trying to achieve beyond concurrency, and why?

A little history is in order.  Parallel programming emerged in an age where the number of practical architectures/platforms for supporting concurrency was limited, as was the technology for programming them.  By and large, these platforms consisted of a handful to hundreds of identical (homogeneous) sequential processors tied together with a uniform communication network, to support either a shared memory or message passing paradigm.   Typically, the entire platform (or some well-defined portion of it, in an approach called space-sharing) was devoted to running one program at a time.

To make the most of these exclusively-allocated processors while reining in complexity, approaches like SPMD (Single Process Multiple Data) or SIMD (Single Instruction Multiple Data) were used to program these machines, where the same instructions (or at least identical programs) ran on several processors at the same time, but with each processor operating on a different part of a larger data set (or parameter space).  Programs were often written using an approach called loop-level (or data) parallelism, where all processors were started on some part of a task (e.g. to process some data structure) at about the same time, generally to deliver their contributions to some specific processor at one time when finished.  This was often at least partially facilitated by assigning different parts of large data structures (typically arrays) to different processors, signifying where operations to those structure elements would be performed. Communication was optimized for these so-called "collective" cases so that information could be efficiently distributed to and/or collected from all the processors at once.  A program consisted of one or more of these collective distribute-compute-collect steps.

Software tools and languages facilitated this approach, using either SIMD languages (like Thinking Machines' C*) on specialized architectures, or more commonly (to whatever extent feasible), executing different iterations of a normally-sequential loop on different processors concurrently instead of in their default/native sequential order, using new languages (or extensions to traditional sequential ones) like HPF and OpenMP.  Tools such as The Force, BSP (Bulk Synchronous Parallel), and collective operations in MPI also supported this collective-execute-and-collect paradigm.  All-in-all, this assumption that a program had an entire, well-defined, homogeneous machine all to itself, and that the programmer was largely responsible for exploiting that machine (in part by optimizing the scheduling of operations on its different processors) to speed up the result, came to be considered as "parallel programming".

Fast forward to the present (and, if you like, future).  Processors are ubiquitous.  Communication is ubiquitous.  PCs, tablets, and even phones have multicore chips (i.e. multiple processors). GPUs (Graphics Processor Units) that were originally intended for, well, graphics are being retargeted for more general computation in cases where there is a good application/architecture match.  Our music, pictures, documents, etc. are often stored and processed in "the cloud", along with web searches and apps, on perhaps thousands of individual (generally PC-sized) machines handling perhaps millions of requests from different users. The number of processors in these larger platforms (e.g. in the cloud) together with the rate of technology advancement make the economics of homogeneity restrictive:  Rather than replacing the entire platform at one time when parts go bad or are outdated, it's often more economically feasible to replace them piecemeal, with the technology of the day, but this leads to a hodgepodge of different technologies and characteristics.  Virtually all the assumptions of traditional parallel computing are out the window, and with them, the effectiveness of traditional parallel programming approaches.

But it's worse than that.  Even if platforms were still largely uniform collections of homogeneous processors in a uniform topology and roughly uniform latency (memory access time) -- and some still are -- those traditional parallel programming approaches would still be suboptimal, as they always were.  First, the goal of high-level programming (and algorithms) from their very inception has been to divorce the correctness/viability of the program/algorithm from the architecture used to implement it, and these parallel programming approaches fail that, often being dependent both on the number of processors and the latency and semantic characteristics of their interconnection (e.g. message passing, NUMA or UMA shared memory).  Second, limiting concurrent expression to fork-join or one-loop-at-a-time execution often forgoes the opportunity to fully exploit the available processors, such as when multiple loops or a loop and other less parallel sections could execute concurrently:  These approaches were largely chosen as a quick-and-dirty way of reining in complexity, and do not fit all computational needs, being relatively unsuitable for a large class of programs (e.g. employing software pipelining or other kinds of functional decomposition).  Concurrency can scale with data without expressing it as "non-looping loops".  And third, the time for any program to achieve a solution includes the time it must wait around for a platform to become available for the program to even start running, which goes way up if an entire machine must be dedicated to one application at a time (or even space-shared between them).

So then, what, really, is the motivation for using these limited parallel programming approaches over concurrent ones?  I would claim: Primarily laziness.  Laziness in developing truly flexible expressive ways of representing concurrency algorithmically, independent of platform, that encompass loop-level parallelism as well as other kinds of concurrency, and in developing the necessary systems software (or middleware) and methodologies to have the resulting programs run (and share platforms) efficiently.

A programmer should be revealing the available concurrency in their program, their algorithm, and the platform (hardware and system software and middleware) should exploit that, in the context of the current conditions, including the availability of processors and other load on those processors.  If a human needs to inject some guidance, some high level knowledge, to facilitate that algorithm-to-platform mapping, then fine, but let it be in a separate step, rather than within the expression of the algorithm itself.  We've been fully accustomed to having that algorithm/platform independence in sequential programming for decades.  It's time to get real in the concurrent world -- and I would argue that approaches like Scalable Planning Language (ScalPL) allow us to do just that.

Thursday, September 06, 2012

Parallelism "vs" concurrency

My expertise/experience is in parallel programming and parallel tools, and my company name, "elepar", is a shortening of the coined term "elemental parallelism".  Yet my recent book barely mentions parallelism, favoring instead the term "concurrency".  (The working title was in fact "Elemental Concurrency".)  Why the terminology switch from "parallelism" to "concurrency"?

Over the years, there have been many attempts to contrast parallelism and concurrency, as they relate to computation.  Two fairly recent examples (from a quick google search) are:
The upshot in these is that concurrency refers to cases where different activities are vying for shared resources, or when outside forces prevent control over whether they will execute simultaneously, and parallelism refers to the other cases, where (say) you want activities to act simultaneously, and to not compete for resources, usually because you want to get as much done as possible in minimum time.

Is there justification for this stance?  Oxford American Dictionary says parallel (computing) means "involving the simultaneous performance of operations", and concurrent means "existing, happening, or done at the same time".  With such similar definitions, it's no wonder that these terms get confused, but while I can understand attempting to clarify their differences, they are not mutually exclusive.  In fact, if operations are parallel (simultaneous), they are also necessarily concurrent (happening at the same time).

It's the other cases that apparently confuse, where operations are concurrent but not parallel.  Simply put, if many activities overlap in time, at some time scale, they are concurrent, but if they cannot act at precisely the same time because (say) some or all of them are competing for resources, then they can't all work simultaneously (as they take turns accessing the resource), and are thereby not (technically) parallel.  The most clear and common case of this is when there is but one processor or person (considered a resource, if you like) available to carry out a set of concurrent activities.  That processor or person must then advance one activity at a time, or perhaps time-slice or multitask to move them all forward in stages.  Some people consider Petri Nets to embody this concept of concurrency, because one person is often employed to play the token game, even if multiple transitions could technically fire simultaneously.  Also, the transitions in Petri Nets fire atomically, which can be interpreted to mean that each one takes no time, thereby precluding multiple from happening at any one at a time (to some level of precision).

In addition to the computing sense, the term "parallel" also has a geometrical definition, "side by side and having the same distance continuously between them".  If one draws a space-time diagram of a set of activities, and they form parallel (adjacent) line segments in the geometrical sense, then those activities are parallel in the computing sense.  It implies that at some points in time (at least), the events are all active at the same time, though at different places (points in space).  This highlights an advantage of parallelism over just concurrency:  To keep the individual events each going, to get as much done in as short a time as possible.  And it illustrates how to accomplish that:  By having the different activities at different places, and interacting as little (i.e. being as independent) as possible.

But now to get serious.  Since concurrency is a precondition to parallelism, parallel programming is necessarily concurrent programming.  And in common usage, although atomic actions (e.g. transactions) can be considered abstractly to take no time, in reality they do take time and can thereby be parallel.  Plus, regardless of how one implements parallelism, virtually all parallel activities share some resources in a mutually-exclusive fashion, so using that as a basis for differentiating concurrency and parallelism is not as definitive as it might at first seem. (Parallel activities which don't share resources are rare enough that they merit their own terminology, like "embarrassingly parallel".)  In fact, it's worse than that: Occurrences at different places (e.g. on different processors) are (relativistically speaking) operating relative to different clocks, so simultaneity is not clearly and objectively defined.

But even assuming that one comes up with objective definitions of parallelism and simultaneity, etc., to delineate some concurrent programs as parallel ones, what features should a parallel programming methodology have over a concurrent programming methodology?  And moreover, what would the advantages and disadvantages of building parallel programs be, as opposed to concurrent ones?  I'll address that in my next blog entry.