Elemental Parallelism: How about data movement? (For real, this time.)

In the previous post, I addressed the kinds of inherent constraints that an F-Net (and/or ScalPL) plan imposes (and does not impose) on data movement. The question then becomes how to turn these constraints into an implementation: Sooner or later, the decision to move or not move each data item must be made to meet the constraints, and something must make those decisions.

The most basic way to meet the constraints is by assigning all the computations (actons) to one processor, putting all the data (resources) into its local memory, and executing the whole thing sequentially. Unlike some other models (like most message passing) with semantics that intrinsically impose additional copying and/or data movement, or that require context switching between heavyweight address spaces, this sequential execution of F-Nets requires little if any overhead over a native sequential expression of the same computation, either for data movement/copying or context switching. In F-Nets, if a sub-computation/acton is going to complete at all, it can always be executed to completion before moving on to any other subcomputation/acton, and there is no need to save transient/intermediate state in this case. So the only need for more heavyweight context switching is when an acton diverges ("infinitely loops") or appears it may do so. (Most message passing requires data copying or movement even in a sequential environment, because both the sender and receiver are assumed to have their own mutable copy of the message after the communication, and may both be required to make progress for the overall computation to progress.)

But in the more common case where we want an F-Net/ScalPL plan to exploit the parallelism of a platform, how best to decide how (and when) to embed the plan (program or strategy) onto that platform? In the good old days, parallel programmers made several assumptions about both the platform and the algorithm to simplify this embedding process: We assumed that the platform consisted of a fixed number of heterogeneous, dedicated, fault-free general-purpose processors connected with a predictable, regular communication topology. We also often modeled the application/plan as a set (and usually a sequence) of parallelized steps ("epochs", phases) such as array operations, where each potentially-concurrent operation in a step took about as long as any other operation in that step to complete. Based on such assumptions, it was sometimes viable to statically map the computations and their associated data onto the processors in the platform.

Under those same assumptions, an F-Net or ScalPL plan can also be statically embedded on a platform in similar ways. Rather than assigning resources to processors (as some languages might assign array elements under an "owner computes"-like rule), it makes far more sense to assign the computations themselves -- i.e. actons -- to processors, by annotating their contexts. Another approach is to assign each control state of each resource to a processor. In either of these cases, the constraint (implied in the F-Net itself) that the content state of a resource must be colocated with the acton during execution, is assumed, and annotations which suggest conflict to this can be resolved in a number of sensible ways (e.g. majority rules). In ScalPL, by allowing these annotations to include the index values supplied by so-called "dup bindings", such embeddings can be a function of the array elements processed, so are no less expressive than those in HPF, etc.

Even so, just as can crop up in deep subroutine calls in HPF, the hierarchical and dynamic nature of ScalPL means that different (perhaps recursive) activations of the same plan (strategy, program) may be best assigned to different parts of the platform, yet preferably related to the embedding of their parent activation in sensible ways -- so static labeling of contexts isn't necessarily sufficient. This can be at least partially addressed by a simplified parameter passing -- i.e. providing special (integer) variables in child plans which are set equal to the annotation of the context in the parent, and which the child can then use (e.g. in mathematical expressions) when annotating its own contexts.

Even better is to consider representing the overall embedding as a two-step process, the first as mapping the computations to an abstract platform, which is considered to have easily-understood internode properties, and the second as mapping that abstract platform to a concrete/actual one to hopefully preserve as many of those properties as possible. My work has focused on an abstract platform (AP) representation of nodes addressed by finite sequences of non-negative integers, allowing the AP topology to be interpreted in a number of ways -- e.g. as a sequence/cycle (e.g. if just single integers), or a rectangular grid (e.g. if tuples of integers), or a tree (e.g. with zero-length sequence representing the root, any other sequence representing a descendant of all nodes addressed by its prefix sequences). The annotation/parameter passing described in the previous paragraph can now be of these sequences, instead of integers, with the child able to append new suffixes to the sequence it was passed. Mapping of the AP to the actual platform can be manual or automated, based on various assumptions. A common shorthand is to assign all nodes having the same prefix to a particular actual processor, which can be interpreted as assigning an entire row or column (if the AP is considered a grid) or subtree (if the AP is considered a tree) to that node. (People who are accustomed to HPF and similar embedding languages may feel it lacks constructs for transpose, etc., but (a) these would be handled in embedding to the AP, not mapping the AP to the actual architecture, and (b) such a transpose, if necessary at all, is imposed by mapping the pre-transpose computations and post-transpose computations to appropriate nodes, since mappings effectively change between steps rather than within them.)

But that leaves two important points. The first is that the traditional simplifying assumptions for parallel static embedding (summarized in the third paragraph, above) are virtually all unrealistic these days. Applications for concurrency are no longer confined to scientific and/or matrix operations, so their concurrency is no longer regular in shape or uniform in subcomputation duration. And this irregularity, together with increased concurrency, makes it less practical to devote entire machines (or portions thereof) to specific applications, so node sharing (yielding variable loads) must also be accommodated. And as platforms grow ever larger, it becomes less practical to make them heterogeneous and fault-free. The bottom line is that static embedding becomes all but impractical: These embedding decisions must be made at runtime, in real time. Using the two-step process mentioned above, only the mapping of the AP to the concrete platform must be dynamically determined.

And the second point, related to the first, is that making such dynamic embedding decisions on a fine-grained basis is also impractical. This is another reason for facilitating non-trivial, large-granularity subcomputations as the basis for embedding. And that, in turn, requires that these subcomputations be more flexible and expressive than traditional data flow-- e.g. allowed to produce multiple results (to each potentially be directed to different downstream subcomputations), and results which affect only parts of larger data structures. This is a prime motivation behind the structural properties which have been endowed upon actons (the subcomputations) in F-Nets/ScalPL. Because these actons are typically entire subcomputations/subprograms in themselves, the embedding decisions within an acton to a single processor (to memory, registers, logical units, etc.) to carry them out can be performed statically and locally, such as by compiler, or if necessary, by OoO scheduling within the processor. These local, lightweight decisions are, by now, well understood.

So that is the answer to how data movement is determined within an F-Nets/ScalPL plan (i.e. strategy, potentially consisting of sub-strategies, ultimately down to actons): This balance, between explicit parameterized embedding of a strategy/plan to an Abstract Platform, dynamic (or, in restricted cases, static) embedding of the AP to a concrete platform, and traditional compiler/processor scheduling of processor resources for each acton within the strategy/plan. The book goes into some depth in describing an entity known as The Director, and how it can efficiently determine when to move data. There is also significant discussion on how to optimize the movement of data depending on architecture type (e.g. shared vs distributed memory) and the form of the strategy (e.g. with multiple readers of a single resource, readers and writers, etc.)

Elemental Parallelism

Wednesday, March 20, 2013

How about data movement? (For real, this time.)

No comments: