Elemental Parallelism: Comparing ScalPL and Concurrent Collections (CnC)

The previous post here, showing a Cholesky decomposition in ScalPL, together with previous posts here and in the book (e.g. showing map-reduce and sorting in ScalPL), should now provide some substantial basis for readers to compare ScalPL with other extant methods for addressing similar goals. In this post, I'll focus on Concurrent Collections (CnC). In some sense, it's an apples-to-oranges comparison. The investment and people involved in CnC has been markedly greater than that for ScalPL, apparently leading to systems that can actually be benchmarked. Though early prototypes of ScalPL (under names like LGDF, CDS, and F-Nets) have been implemented on a number of machines, comparable benchmark numbers are not available. So the comparisons here will largely be on the order of their specifications. Even so, there are many reasons to suspect that ScalPL could be implemented at least as efficiently as CnC, and that, in fact, runtime tools developed for CnC could be leveraged, in large part, for facilitating ScalPL execution, if necessary.

CnC Cholesky diagram (See Knobe video)

Concurrent Collections (CnC) and ScalPL have several similarities in form, ancestry, and capabilities. Both are based on large-grain data flow principles, their actions ("steps" in CnC, "tactics" in ScalPL) are each implemented in some traditional base language (C, Java, Fortran) and are deterministic, stateless, and two-phase (consuming all inputs then producing all outputs), and both approaches have some form of visual syntax: In both, active steps are represented as circular/oval connected with rectangles representing data (or data collections) with directed arcs showing the flow of data. For the Cholesky example, this explains why the CnC and ScalPL diagrams are significantly similar: A rectangle representing tiles, 3 circles for Cholesky, Trisolve, and Updates, and some additional notation above circles suggesting further constraints.

Brief description of CnC

ScalPL Cholesky strategy (see previous post)

Before pointing out the significant differences, however, a brief summary of CnC is in order. It supports three important kinds of collections, or sets: tags, steps (activities), and items (data). Each step collection is associated with a tag collection in a "prescribes" relation, such that putting a tag into the tag collection prescribes a step (to be activated) with that tag in the step collection. Steps may get and put items into an item collection, and may put tags into a tag collection. Items in an item collection have unique tags, or keys. CnC describes "exactly 2 ordering constraints: (1) Producer must execute before consumer, and (2) controller must execute before controlee". A producer-consumer relationship is one where a producer step "put"s an item with a particular tag into an item collection, and a consumer step "get"s something with that tag. The controller-controlee relationship means that the environment or a step (the controller) "put"s a tag into a tag set, allowing a step (the controlee) with that tag to be initiated in the prescribed set.

From this and further reading, it can be inferred that (in normal operation) only one item with a given tag ever exists in a given CnC item collection, usually by keeping each item around (after it is put) for the life of the collection, never allowing it to, disappear, change value, or be superseded or duplicated with one with the same tag/key. Instead, to effectively change the value of an item, you need to create a new item with a new tag. This explains why the CnC Cholesky example includes the iteration number as part of the tag. It also means that even a simple program like their Fibonacci example keeps all of the values of the entire sequence, even though only the final one is returned. There is a tuning feature which allows the specification of a "get-count" for each item, such that the item will disappear after that number of gets has been performed on it: It is not clear why this is not considered a property of the algorithm, but if it allows a put for another item with that tag after the get-count expires, it would surely violate the single-assignment property, and thereby deterministic behavior, that the model advertises.

I quibble with the CnC characterization of the controller-controlee relationship, or even the use of the word "control" for this in general, since it's no more control than any other communication. That is, if there was a way (as in Linda, say) to just get (and remove) any (unspecified) item from a set and access its contents, then the prescription semantics would apparently be identical to simply considering all steps as starting with such a "get any" from the prescribed tag set, instead of (as currently) having the scheduler/system get the tag and initiate the step, and then pass the tag as an argument: In either case, the step can't complete until both the prescribed tag and all of the other data items are available. In fact, while the CnC example for Cholesky is described as having a controller-controlee relationship between the Cholesky step and the TriSolve steps (for example), it is not at all clear to me why an application programmer (or discipline specialist) would see it this way. ScalPL just sees this as a standard data dependence (and, when necessary, anti-dependence).

Some Comparisons with ScalPL

Regarding efficiency and scalability:

CnC isn't expressive enough (without tuning hints like "get_count") to express when data space can be reused/collected, and even these tuning hints are fraught with potential error (e.g. why would one expect item lifetime to be based on a simple count of accesses?). In ScalPL, the potential observation or updating of a data item is considered a natural part of an algorithm expression (i.e. not platform dependent), and potential reclamation/reuse of space is a natural side-effect of this expression.
By default, a step in CnC may begin consuming execution resources as soon as the prescribing tag exists, perhaps long before the data items it is to consume are available, leading to subsequent stalling or abort/restart, because in general, CnC isn't expressive enough to declare which items must be available before step initiation (unless tuning "depends" hints are added). ScalPL is carefully designed so that all resources ("items") needed by a tactic ("step") are declared (or can otherwise be determined) before execution is initiated by a scheduler, even if some resources are dependent upon the content of others, thus allowing a tactic to be scheduled for execution only when it can proceed to completion without blocking.
CnC does not allow update-in-place of items. This can add significant overhead on (shared) memory based systems (e.g. to allocate new space and copy unchanged data to effect updates), and alters natural algorithmic approaches. ScalPL allows natural update-in-place of its resources.
The above factors give CnC relatively static granularity, in that communication between steps can have significantly higher overhead than that within a step. Since ScalPL does not require extra copying to effect communication between actons/tactics ("steps"), several actons can be effectively merged into a single grain (at compile/tuning time) when higher granularity is beneficial for a platform. In other words, in ScalPL, breaking a computation down into multiple tactics does not inherently introduce significantly higher overhead overall than expressing it as a single one, it just exposes more potential parallelism for cases where it can usefully be exploited.

From a software engineering standpoint:

CnC steps are inherently non-reusable, in that they directly name the item sets to which they refer. In ScalPL, all modules (both tactics and strategies) are fully parameterized (using transitions and roles) to be reusable in a number of different contexts by just altering transition and role bindings to resources and their control states.
CnC has no executable modules other than steps, which are restricted in many ways (e.g. to be stateless, two-phase, deterministic). Although ScalPL tactics have similar restrictions, ScalPL allows higher-level modules (strategies) to be constructed from tactics to have any sorts of properties (and combinations thereof) desired.
CnC programs are always deterministic, which means that CnC cannot express some otherwise useful programs. ScalPL programs/plans, and strategies (subprograms), can be easily constructed to be (provably) deterministic when that is convenient, but nondeterminism can also be introduced when desired to allow full expressiveness of other approaches (like locks or messages) and potentially faster execution than fully deterministic approaches.

So is ScalPL less expressive than CnC in any way? Apparently not. ScalPL's resource arrays, together with role binding modifiers, would seem to handle anything that CnC's collections can, and more, and even without using ScalPL's strategies, instantiation levels, or any number of other constructs, its tactics alone would seem to cover the functionality of CnC steps.

A stated goal of CnC is to completely separate the specification of the program (and its dependencies) from the tuning of the program. However, some of the factors which CnC considers as "tuning", such as get_counts and depends clauses, are actually not dependent on target platform or tuning so much as being parts of the algorithm which are evidently considered too complex or troublesome for the discipline scientist to deal with. ScalPL makes no bones about the fact that it is essentially targeted at the "c-language" level for parallel and concurrent programming, and as such, isn't intended for a scientist to express their algorithms without knowing something about its intricacies. The only goal is to express an algorithm in a way to make it efficiently executable on a wide variety of architectures while still having desirable analytic properties. However, ScalPL does still expect a separate step for tuning for a particular platform, in which a human tuner, familiar with the algorithm and the platform, might suggest placement of computations and/or increasing granularity by collecting resource elements and/or tactic executions together. This would not change, but just restrict, the possible behavior of the program.

Conclusion

Is it surprising that the designer/developer of ScalPL would come to such conclusions? More interesting would be to see my findings challenged, such as in the comments section, and an ensuing discussion. No doubt, my intro can be restated back to me, that CnC is real in some sense, and ScalPL isn't (in some sense), but that would dismiss a great deal of analysis and design on ScalPL's part a little too glibly. If there are claims that ScalPL's design is flawed, such as cannot be efficiently implemented or simply doesn't make physical or mathematical sense, I would be as interested as others to understand how, but after so many years of development and consideration, it would take some convincing for me to believe this.

Elemental Parallelism

Tuesday, April 15, 2014

Comparing ScalPL and Concurrent Collections (CnC)

No comments: