Sunday, June 29, 2014

Exciting new directions for ScalPL: Splatr

I haven't posted much over the last few months, in part because I have been working hard in an exciting new direction, and figured it would make more sense to organize and firm up some of those ideas before posting them here.  In fact, I have been using a certain workshop deadline (originally June 20, then changed to June 27) to motivate myself to finally get these concepts down formally in writing, for submission.  Alas, I missed the deadline, but only just barely, and partially because the ideas have kept coming at such a rapid pace, and partially because the conference paper was understandably restricted to 15 rather small pages.  With reasons like this, I don't see the missed deadline as a failure.  While I expect it to be possible, at some point, to express the important concepts in such a concise form, for now, it is more important for me to ensure the concepts are fully consistent and complete before I begin to abbreviate.

So what is this direction? A few years ago, as I whiled away some hours at a supercomputing conference, I began to wonder how difficult it would be to make ScalPL (Scalable Planning Language, described on this blog and elsewhere) easier to learn and use, and to that end, probably more like some existing parallel computer languages.  While I firmly believe that ScalPL is more flexible and better designed in many ways than most of those existing languages, there is also no arguing that it takes longer to explain and to learn the constructs and how they fit together.  Why?
  • Many of these existing parallel languages start with a sequential language and then add a few constructs here and there, subtly introducing and altering semantics.  This is interpreted as lowering the learning curve... at least until the programmer understands some of the inherent restrictions, and how those "little" syntactic and semantic changes can have far-flung implications.
  • Instead, ScalPL is just glue between modules ("tactics") implemented in traditional sequential languages as-is, with virtually no syntactic or semantic changes to those languages (except a few restrictions).  While this allows ScalPL to stay completely "language agnostic", not caring what language is (or languages are) used for tactics, it also means (for example) that ScalPL has its own type system separate from the tactic languages, and associations between them must be made at each tactic interface.
  • The individual modularization of each tactic is not only clumsy and verbose, it can also affect the understandability of the program, and the ability to propagate type information in an obvious, sensible way.  Everything must be spelled out, there is little type inference or lexical scoping.
  • ScalPL is a graphical language (uses graphical language constructs, like circles, boxes, lines), which is very different from what most programmers (and their tools) are accustomed to.  This is another hike up the learning curve.

Fine, so how many of these drawbacks can be overcome?  There may be no overcoming the first without total redesign, and if the advantages of doing so are skin-deep rather than substantive, it may not pay off anyway.  As for the last, I have already introduced a textual form of ScalPL, called SPLAT (Scalable Planning Language as text), but in doing so, made no effort to overcome the other obstacles listed: SPLAT is primarily an archival form of ScalPL that can be understood by humans and manipulated by more traditional tools. So what could be done to address the above in a text/lexical form by restricting our attention to one tactic language, and by borrowing that language's type system and permitting some amount of lexical scope?

My answer to that ("what could be done") is a new textual concurrent language called Splatr (SPLAT Refined, pronounced "splatter").  As I say, I'm excited, and even though its constructs are based entirely on ScalPL, I think I have looked at some old problems in some novel ways.  Also, thanks to this blog and the book, I have some pretty good sample programs to try it out on.  I'll probably also put together a working parser before publishing the syntax, but I might expose some of the basic forms before that.

If all works out, people will grasp it fairly quickly, and see some vague similarities to languages they've seen before (Occam, UNITY), while preserving the theoretical and practical advantages of ScalPL.

Thursday, May 01, 2014

ScalPL As Straw Man (or Channeling Bill Dally)

Recent interaction with a former (and future?) colleague prompts me to clarify some positions expressed here.  For example, some might interpret my frequent mention of ScalPL (a language/technique which I developed) as a sweeping proclamation of its superiority over other approaches.  Rather, my intent is to highlight a number of issues which must be addressed for truly high-performance computing (e.g. exascale), and then to use ScalPL as one potential way to address them.  It is, in a sense, a straw man:  If you have another approach which handles these issues better, I'd be very interested in hearing about it.  Even if you don't know of a better way, I will not be surprised to see someone (even possibly me) recommending improvements to ScalPL, perhaps relegating ScalPL in its current form to a stepping stone to something better, or just sort of an architecture-independent intermediate language.

What frustrates me is to instead hear "I agree that we need to address those issues, and I don't know of any other way to address all of them, but don't bother me with yet another programming approach (or at least, not one so different from existing approaches)."  Or, simply, "I don't think we'll need to address those issues", with no further support for that assertion.

So, what are the issues that I claim must be addressed?  I hit them one at a time in the book, but I am just as happy to defer here instead to Bill Dally.  Many posts ago, I mentioned one of his talks that I saw at SC12.  More recently, I found a video made by him at about the same time on the same topics, as well as another of a much more recent talk.  Obviously, I didn't design ScalPL based on these talks, but for the sake of argument, let's say I did.

I may not be the most objective listener, but what I hear Dr. Dally saying here is that:
  • Exaflops machines cannot afford to spend energy (and expend heat) for OoO logic to address latency issues (on the fly) that could instead be handled statically (in advance) by programming, preprocessing, and optimization -- i.e. by scheduling the data which will be needed close to the computation before the computation needs it.
  • Significant concurrency and locality information will need to be extracted from programs to achieve exaflop performance goals.
  • Programs should not be expressed in a way that would require them to be recoded for new architectures.  Instead, the program should code the desire algorithm, optimization tools should then sensibly embed that algorithm on a particular platform, and (if necessary) a human tuning expert can further optimize those embeddings and/or scheduling by hand, still (optimally) without modifying the initial coding.
The last two points (at least) are not so different from what many of us have been saying for a very long time.  As for the second point, regardless of whether the programmer expresses the concurrency and locality explicitly, or a subsequent tool or language processor extracts them from a more obfuscated representation in the program, at some point, there should be a representation which embodies these factors.
I admit, I have taken some of these a step further than Dr. Dally does in these talks.  For example, if we're not asking processors to stall or context-switch (for any significant period) to compensate for latency, should we ask them to do so for inter-process communication?  I think not.  Fortunately, they're related:  If you know what data a computation (i.e. thread snippet or "threadlet") will need before it needs it, including that protected by critical sections, you might just as well wait until those sections are also available before starting execution.  That is also a step toward making each of the snippets/threadlets "two-phase" (accessing all inputs before producing any outputs), and therefore sequentially consistent, easing subsequent reasoning about their combined execution.

Then there's the form of communication between these snippets/threadlets.  Message passing exacts overhead, unneeded if the snippets/threadlets end up close to each other, and especially in cases where information could just be shared (without any copying).  In a sense, it cements in the granularity of these snippets:  They can never again be recombined into larger grains to make that overhead disappear.  Some sort of PGAS model makes more sense, given the previous discussion (e.g. with data controlled by critical sections), especially if it can be made to act like message queuing where useful, to use queuing and copying if and only if it compensates for latency or increases concurrency.  (ScalPL gets the best of both worlds through careful formal semantics that can be implemented in different ways while having the same logical outcome.)

There are a few other more tangential issues, like also avoiding overhead related to checkpoint/restart and/or deadlock prevention/detection.  But those sort of fall out automatically, once you handle the other issues cleanly.  That is, if the computation is broken into deterministic snippets/threadlets, and they're two-phase, then as long as we keep old (replicated) input data around sufficiently long, we can recreate new (unreplicated) data through re-execution until it is adequately replicated, without time consuming snapshots and back-outs.  And if we know all of the data needed by a snippet before it executes, then deadlock can be completely avoided by atomically acquiring it all as part of initiating execution.

A rather subtle point of all of this is that these preconditions to execution of a snippet/threadlet are (practically) sufficient in themselves:  By making our critical sections (locks) just a little more flexible by adding more states if necessary (other than just "locked" and "available"), we can toss control state completely -- or, put another way, these new lock states become the control state in a more distributed fashion, in deciding when a particular snippet/threadlet should execute.

That's the brunt of ScalPL.  The graphical syntax is just a natural representation, once it is determined that the execution of any specific snippet/threadlet ("tactic" or "acton" in ScalPL) is determined completely by the states of the data items ("resources" in ScalPL) that it accesses.  Obviously, there are other ways to address the main goals than the ways ScalPL has chosen, but I will be surprised to see another so inclusive, elegant, and flexible.  Other constructs in ScalPL are mostly there for software engineering purposes (modularity, OO).  I would really like to see other people propose other ways to address all of these issues, or to suggest improvements to ScalPL without violating them -- or, heck, to explain why it's unnecessary to address them.




Tuesday, April 15, 2014

Comparing ScalPL and Concurrent Collections (CnC)

The previous post here, showing a Cholesky decomposition in ScalPL, together with previous posts here and in the book (e.g. showing map-reduce and sorting in ScalPL), should now provide some substantial basis for readers to compare ScalPL with other extant methods for addressing similar goals.  In this post, I'll focus on Concurrent Collections (CnC).  In some sense, it's an apples-to-oranges comparison.  The investment and people involved in CnC has been markedly greater than that for ScalPL, apparently leading to systems that can actually be benchmarked. Though early prototypes of ScalPL (under names like LGDF, CDS, and F-Nets) have been implemented on a number of machines, comparable benchmark numbers are not available.  So the comparisons here will largely be on the order of their specifications.  Even so, there are many reasons to suspect that ScalPL could be implemented at least as efficiently as CnC, and that, in fact, runtime tools developed for CnC could be leveraged, in large part, for facilitating ScalPL execution, if necessary.

CnC Cholesky diagram (See Knobe video)
Concurrent Collections (CnC) and ScalPL have several similarities in form, ancestry, and capabilities.  Both are based on large-grain data flow principles, their actions ("steps" in CnC, "tactics" in ScalPL) are each implemented in some traditional base language (C, Java, Fortran) and are deterministic, stateless, and two-phase (consuming all inputs then producing all outputs), and both approaches have some form of visual syntax:  In both, active steps are represented as circular/oval connected with rectangles representing data (or data collections) with directed arcs showing the flow of data.  For the Cholesky example, this explains why the CnC and ScalPL diagrams are significantly similar: A rectangle representing tiles, 3 circles for Cholesky, Trisolve, and Updates, and some additional notation above circles suggesting further constraints.

Brief description of CnC

ScalPL Cholesky strategy (see previous post)
Before pointing out the significant differences, however, a brief summary of CnC is in order.  It supports three important kinds of collections, or sets:  tags, steps (activities), and items (data). Each step collection is associated with a tag collection in a "prescribes" relation, such that putting a tag into the tag collection prescribes a step (to be activated) with that tag in the step collection.  Steps may get and put items into an item collection, and may put tags into a tag collection. Items in an item collection have unique tags, or keys.  CnC describes "exactly 2 ordering constraints: (1) Producer must execute before consumer, and (2) controller must execute before controlee".  A producer-consumer relationship is one where a producer step "put"s an item with a particular tag into an item collection, and a consumer step "get"s something with that tag.  The controller-controlee relationship means that the environment or a step (the controller) "put"s a tag into a tag set, allowing a step (the controlee) with that tag to be initiated in the prescribed set.

From this and further reading, it can be inferred that (in normal operation) only one item with a given tag ever exists in a given CnC item collection, usually by keeping each item around (after it is put) for the life of the collection, never allowing it to, disappear, change value, or be superseded or duplicated with one with the same tag/key.  Instead, to effectively change the value of an item, you need to create a new item with a new tag.  This explains why the CnC Cholesky example includes the iteration number as part of the tag.  It also means that even a simple program like their Fibonacci example keeps all of the values of the entire sequence, even though only the final one is returned.  There is a tuning feature which allows the specification of a "get-count" for each item, such that the item will disappear after that number of gets has been performed on it:  It is not clear why this is not considered a property of the algorithm, but if it allows a put for another item with that tag after the get-count expires, it would surely violate the single-assignment property, and thereby deterministic behavior, that the model advertises.

I quibble with the CnC characterization of the controller-controlee relationship, or even the use of the word "control" for this in general, since it's no more control than any other communication.  That is, if there was a way (as in Linda, say) to just get (and remove) any (unspecified) item from a set and access its contents, then the prescription semantics would apparently be identical to simply considering all steps as starting with such a "get any" from the prescribed tag set, instead of (as currently) having the scheduler/system get the tag and initiate the step, and then pass the tag as an argument:  In either case, the step can't complete until both the prescribed tag and all of the other data items are available.  In fact, while the CnC example for Cholesky is described as having a controller-controlee relationship between the Cholesky step and the TriSolve steps (for example), it is not at all clear to me why an application programmer (or discipline specialist) would see it this way.  ScalPL just sees this as a standard data dependence (and, when necessary, anti-dependence).

Some Comparisons with ScalPL

Regarding efficiency and scalability:
  • CnC isn't expressive enough (without tuning hints like "get_count") to express when data space can be reused/collected, and even these tuning hints are fraught with potential error (e.g. why would one expect item lifetime to be based on a simple count of accesses?).  In ScalPL, the potential observation or updating of a data item is considered a natural part of an algorithm expression (i.e. not platform dependent), and potential reclamation/reuse of space is a natural side-effect of this expression.
  • By default, a step in CnC may begin consuming execution resources as soon as the prescribing tag exists, perhaps long before the data items it is to consume are available, leading to subsequent stalling or abort/restart, because in general, CnC isn't expressive enough to declare which items must be available before step initiation (unless tuning "depends" hints are added). ScalPL is carefully designed so that all resources ("items") needed by a tactic ("step") are declared (or can otherwise be determined) before execution is initiated by a scheduler, even if some resources are dependent upon the content of others, thus allowing a tactic to be scheduled for execution only when it can proceed to completion without blocking.
  • CnC does not allow update-in-place of items.  This can add significant overhead on (shared) memory based systems (e.g. to allocate new space and copy unchanged data to effect updates), and alters natural algorithmic approaches. ScalPL allows natural update-in-place of its resources.
  • The above factors give CnC relatively static granularity, in that communication between steps can have significantly higher overhead than that within a step.  Since ScalPL does not require extra copying to effect communication between actons/tactics ("steps"), several actons can be effectively merged into a single grain (at compile/tuning time) when higher granularity is beneficial for a platform.  In other words, in ScalPL, breaking a computation down into multiple tactics does not inherently introduce significantly higher overhead overall than expressing it as a single one, it just exposes more potential parallelism for cases where it can usefully be exploited.

From a software engineering standpoint:
  • CnC steps are inherently non-reusable, in that they directly name the item sets to which they refer.  In ScalPL, all modules (both tactics and strategies) are fully parameterized (using transitions and roles) to be reusable in a number of different contexts by just altering transition and role bindings to resources and their control states.
  • CnC has no executable modules other than steps, which are restricted in many ways (e.g. to be stateless, two-phase, deterministic).  Although ScalPL tactics have similar restrictions, ScalPL allows higher-level modules (strategies) to be constructed from tactics to have any sorts of properties (and combinations thereof) desired.
  • CnC programs are always deterministic, which means that CnC cannot express some otherwise useful programs.  ScalPL programs/plans, and strategies (subprograms), can be easily constructed to be (provably) deterministic when that is convenient, but nondeterminism can also be introduced when desired to allow full expressiveness of other approaches (like locks or messages) and potentially faster execution than fully deterministic approaches.

So is ScalPL less expressive than CnC in any way?  Apparently not.  ScalPL's resource arrays, together with role binding modifiers, would seem to handle anything that CnC's collections can, and more, and even without using ScalPL's strategies, instantiation levels, or any number of other constructs, its tactics alone would seem to cover the functionality of CnC steps.

A stated goal of CnC is to completely separate the specification of the program (and its dependencies) from the tuning of the program.  However, some of the factors which CnC considers as "tuning", such as get_counts and depends clauses, are actually not dependent on target platform or tuning so much as being parts of the algorithm which are evidently considered too complex or troublesome for the discipline scientist to deal with.  ScalPL makes no bones about the fact that it is essentially targeted at the "c-language" level for parallel and concurrent programming, and as such, isn't intended for a scientist to express their algorithms without knowing something about its intricacies.  The only goal is to express an algorithm in a way to make it efficiently executable on a wide variety of architectures while still having desirable analytic properties.  However, ScalPL does still expect a separate step for tuning for a particular platform, in which a human tuner, familiar with the algorithm and the platform, might suggest placement of computations and/or increasing granularity by collecting resource elements and/or tactic executions together.  This would not change, but just restrict, the possible behavior of the program.

Conclusion

Is it surprising that the designer/developer of ScalPL would come to such conclusions?  More interesting would be to see my findings challenged, such as in the comments section, and an ensuing discussion.  No doubt, my intro can be restated back to me, that CnC is real in some sense, and ScalPL isn't (in some sense), but that would dismiss a great deal of analysis and design on ScalPL's part a little too glibly.  If there are claims that ScalPL's design is flawed, such as cannot be efficiently implemented or simply doesn't make physical or mathematical sense, I would be as interested as others to understand how, but after so many years of development and consideration, it would take some convincing for me to believe this.

Saturday, March 22, 2014

Cholesky Factorization in ScalPL

(This replaces two former posts with this approximate title, which have been merged and cleaned up here.) [Errata: In first diagram (Cholesky strategy), the dot next to "Altered" in lower left corner should be yellow instead of red.  And in third diagram (UpdateRow strategy), it should have been mentioned that the binding modifier rules were omitted because they are all the same: "1" (i.e. reduction in the first dimension).]

In this post, I'll address an implementation of tiled Cholesky factorization using ScalPL.  I've chosen this problem specifically because it is often used as a "real world" example when demonstrating various parallel programming models or languages.  In that vein, two references which seem especially clear (and which may serve for future comparisons) include a Concurrent Collections (CnC) implementation (Knobe) and a NUMA implementation (Jeannot).  It should be noted that the latter effectively computes the result in place, while the former apparently doesn't, but since data can be assumed to move between processors in both cases, the notion of being "in place" is not as clear as one might expect.  Other useful references include wikipedia and Matlab descriptions.

First, the general algorithm in English as I understand it. (If I make mistakes here, I expect them to be minor enough not to lose the overall point, and I entrust commenters to straighten me out: It is surely due to errors in my translation, and not in these original references.)  It iterates over the lower diagonal of a symmetric square matrix -- i.e. the elements (i,j) where j (the column) is less than or equal to i (the row), with the assumption that the elements above that diagonal are equal to their "reflected" counterparts below.  The matrix is then assumed to be tiled, so that each element of the matrix (array) operated upon by the algorithm actually contains a fixed-size square tile of the original, larger matrix, but this aspect is below the level of detail addressed here.

The algorithm:  Each iteration first computes the element (i.e. tile) in upper left corner (1,1) with a routine (or "tactic", in ScalPL) called cholesky here (DPOTRF in the Jeannot paper, which uses the LAPACK kernels).  This computation relies only on the previous value of this element to produce the new value.  The algorithm then computes each of the other elements in the first column (i.e. directly below that top corner) with a routine called trisolve here (DTRSM in the Jeannot paper), updating each element using the new value in 1,1 as well as its own former value.  And then, whenever the appropriate inputs are available, the algorithm computes all the other elements (i.e. those not in the first column), where element i,j is updated using its previous value and elements (i,1) and (j,1) from the first column.  For non-diagonal elements, routine update here (DGEMM in the Jeannot paper) is used for this, and for diagonal elements (for which only one value from the first column is used, since i=j), routine updateDiag (DSYRK in Jeannot) is used.  When enough data has been computed from this iteration, the next iteration (or in our case, recursion) can begin, which works identically except that the whole computation is shifted diagonally down one, so that element 2,2 from one iteration is used as the new top corner of the matrix in the next iteration, so the matrix it operates on is smaller by one in both dimensions.  That means that no element of the first column of any iteration (computed by cholesky and trisolve) will ever be accessed by subsequent iterations, so those results can theoretically be used by (passed back to) the caller ASAP, such as if there is more potential concurrency there.  Iterating/recursing continues until an iteration operates on a 1x1 matrix, at which point cholesky is performed on that element and we're done.

A top level ScalPL strategy (i.e. plan or program) to accomplish this is shown below. The matrix of tiles is represented by the resource rectangle at the bottom (labeled Tiles), with two hash marks on the left to represent the two dimensions, and the invocation of cholesky, trisolve, and Updates are represented by the three circles with those labels.  (We'll use the convention that capitalized names like Update represent strategies, uncapitalized represent tactics.)  The circle at the right (which is unlabeled due to having a role bound to, and thus named, $plan) represents the recursive invocation to the next iteration.  The circle at the left (truncateRange) simply initializes resource TwoToN with the range of indices of the matrix it is operating upon except for the first (i.e. 2 to N, where N is the matrix size), to provide indices for trisolve and Updates.  The remaining circle (containing the $null tactic to the right of cholesky) is to "short circuit" and return the cholesky result in the final 1x1 case, since trisolve and Updates won't be invoked in that case.  The control state of TwoToN signifies whether this is that final case (in which case it is Only1Row/red) or the normal case (where it is AtLeast2Rows/yellow).

Cholesky strategy (click to enlarge)
As is standard, the binding modifiers (arrows within the circles) carry indices (from their tail, or primary role) to the array being indexed (at their head, or secondary role), with the binding rules (within the dashed boxes) denoting which dimensions are being indexed.  A new construction of parentheses within a binding rule is being used here (part of "ScalPL 1.1"), which means that a single source index is used for multiple dimensions, so, for example, cholesky takes the constant 1 on its primary as both the first and second dimension of its secondary, thus binding that secondary role (Tile) to resource element Tiles(1,1).  Everything else is "by the book".  The "dupall" asterisk on trisolve's top role (to TwoToN) replicates that context for each integer in that range 2 to N, then uses each integer in that range as an index (specified by the green binding modifier) to the first dimension of Tiles for its Lower role (green arrow/rule), and the constant one for its second dimension (red arrow/rule), thus binding each replicant to a different element of the first column.  Similarly, the Column1 role of Updates is bound to the first column of Tiles (rows TwoToN in green, column 1 in red), and the Rest role is bound to everything else (rows and columns TwoToN in black).    The binding rule on the rightmost circle/context (i.e. the recursive call to Cholesky) represents that the first and second dimensions of Tiles are effectively shifted/translated by 1 (the constant on the primary) so that the next iteration can continue to address the "new" first column as 1, etc.

The main difficulty to be addressed in any solution is in obtaining maximum concurrency while still ensuring that elements are not being accessed/read before or during the time they're updated/written.  Specifically, that includes not letting any trisolves in an iteration access element (1,1) before it is established by cholesky, or any of the update routines in an iteration access elements of the first column before they are established by trisolve, or letting subsequent iterations access elements still being established by this iteration, all while also allowing elements of the first column to be modified by the caller (by virtue of having been returned early) before the update routines have finished accessing them.  Control states in ScalPL are purpose-built for handling this sort of issue, and in this case, those on Tiles do a good job:  LastIter (green) for elements not yet updated, Altered (yellow) for those updated but perhaps still needing to be accessed, and returned (red) for passing them back to the parent/activator when finished.

Updates strategy
As indicated by the bindings in the Cholesky strategy above, the Updates strategy (at right) gets elements 2 to N of the first column of the Tiles matrix (which it will only observe/read) as Column1, and the rest of the matrix to its right (some of which it will update) as Rest, as well as the range of indices being represented as TwoToN.  The dupall on the RowNum role replicates the context/circle activating the UpdateRow strategy once for every row 2 to N, supplying the unique row number to each replicant on RowNum, and binding the Row role for each replicant to that row.  The Column1 role for all replicants are bound to the same first column.

UpdateRow strategy
So each replicant of the UpdateRow strategy (at left) gets the first column (as Column1), one row of the rest (as Row), and the row number represented as RowNum.  Tactic internalIndices  initializes TwoToRowNumMinus1 with (yes) the range 2 to RowNum-1.  The left context/circle activates one replicant of update for each integer in that range -- i.e. for each element in the row except the last (which is in the diagonal) -- binding role ToUpdate to that row element, role Upper to the corresponding element of Column1, and role Lower to the element of Column1 corresponding to the row (RowNum).  The right context/circle activates updateDiag just once for the last element of the row (in the diagonal), binding FromColumn to the corresponding element of Column1 and ToUpdate to the element of Row.

These diagrams basically are the program.  All that's left is the coding of the tactics (cholesky, trisolve, update, updateDiag, truncateRange, and internalIndices) in a base language like C or Fortran.  The first four are likely simple calls to standard kernels as in the Jeannot paper (e.g. DPOTRF, DTRSM, DGEMM, and DSYRK, respectively).  The last two are a few lines each, e.g. something like:
internalIndices {TwoToRowNumMinus1->beg = 2; TwoToRowNumMinus1->end = *RowNum - 1;}
One of the trickiest aspects of making a truly portable implementation is in deciding what to do with the first column of an iteration.  If there is potential concurrency in the calling parent/mainline, there may be some merit to returning that first column to the caller soon after it has been computed by cholesky and trisolve, otherwise not.  But even if so, how soon after?  Since that first column must still be read by update and updateDiag steps in the iteration, should return of the column wait until these are done, or should a copy be made for these or for the caller to facilitate an early return?  (On a message-passing system, the copy may be made anyway by virtue of communicating.)  Or maybe access should be returned to the caller immediately, with a "copy on write" made only if the caller then tries to modify the elements while they are still being accessed by the update and updateDiag steps?  Regardless of the answers we choose, building them into the algorithm will make it less portable for cases where they would be better answered differently.

Fortunately, these are exactly the issues addressed by carefully established semantics of predictability in ScalPL.  In this particular case, in the Cholesky strategy, we see that the Column1 role of Updates is predictable (with read permission and one transition), meaning that once something within Updates accesses an element, that transition will eventually be issued in Cholesky for that element, even if Updates itself doesn't issue it -- and indeed, it never does in this case.  This leaves the Director (i.e. runtime) flexibility to wait for any length of time, and/or make copies/snapshots always or on demand/write.  These decisions can be made in realtime, or (perhaps in many cases) via hints/annotations by a human planner which do not alter the behavior.

One more minor/technical point.  In this example, Column1 and TwoToN in Updates are shown as activation formals, represented with filled triangles.  This is primarily to simplify coding by allowing transition back to the initial/green control state, which is not ordinarily allowed for formal resources.  This approach does have the slight downside that Updates cannot activate until all of the elements of the first column are Altered/yellow.  (This also means that the entire Column1 role is effectively accessed by Updates as soon as it activates/binds, so predictable transitions will eventually result to all of those elements.)  The use of activation formals here could be avoided by adding another control state to Column1 and TwoToN, and adding a $null activity to each to transition from green to that new control state, which would then logically be the new initial control state, as suggested in Figure 9.9 in the book.)

I am personally quite happy with this example.  It demonstrates the flexibility of the binding modifiers, even when dealing with triangular arrays, and of the predictability rules in solving very real issues.  Overall, I believe the diagrams are highly intuitive, and carry at least as much information as a similar page area of text would.  And I believe that this demonstrates advantages of ScalPL over other approaches for scalability, portability, correctness, and formality.

Monday, March 03, 2014

Big Data, Map-Reduce, and ScalPL

There is Big Focus on Big Data these days, and specifically tools (like Hadoop) that use approaches like Map Reduce.  The concept is fairly simple:  If one has a distributed file (or other data structure), partitioned so that all data within a particular partition is relatively local, but partitions may be relatively distant from one another, many useful operations can be performed efficiently using a pattern where items from each partition are first processed ("mapped") individually and locally, then those independent results are distributed (hashed, binned) according to some key field so that items with like keys end up physically together or close (e.g. via implied communication), where they can then be combined ("reduced") with other items having the same key.

If ScalPL is supposed to be expressive enough to capture patterns, then it should be able to capture this one, and this popular use-case can provide some insight into the expressiveness and flexibility of ScalPL.  Optimally, the ScalPL plan will accommodate the benefits and ordering of map-reduce just mentioned for a distributed platform/architecture, but will also work on different platforms with different characteristics where it might, perhaps, benefit from more pipelining, less communication, etc.   Remember, with a proper runtime system ("Director"), a ScalPL plan is executable -- a program -- rather than just a first step to a final implementation.

Possible High-Level Context for MapReduce
Before we get started with the MapReduce plan itself, a few words about the context in which it will activate.  At this point, there is no standard (object) library in ScalPL for distributed files (or any other kind, for that matter), though constructs exist to build/standardize some.  More about that in another post.  For our purposes, the contents of a distributed file in ScalPL (FileContent in the plan here) will simply be modeled as a two dimensional resource array, with the first dimension being the block, and the second being the unit (e.g. byte) within the block, with special byte markers for end of record and end of block.  ScalPL does not itself dictate how the elements of an array are physically stored relative to one another, but for efficiency purposes here, they can easily be stored just as a normal file system would (e.g. as contiguous characters within each block, and scattered blocks).

The blocks (of course) may not be stored or indexed contiguously at some low level, but by the time the MapReduce plan accesses them, they can be made to appear sequential.  In the example above, FileBlockIndices is a sequence of indices to the blocks in the file, and FileBlockIndexCount is the number of (or range) of where those indices are stored in FileBlockIndices.  FileBlockSize is the maximum number of bytes per block.  By using the ScalPL mapping binding modifier (1/1 here) and the anonymous (underscore) role binding from FileBlockIndices, the MapReduce plan just sees (on FileContent) a sequential collection of blocks numbered sequentially within FileBlockIndexCount.  We have that plan producing its results to KeyResults, indexed by key, and changing the control state of that key to red.  (If a result is not produced for a given key, the control state for that key remains green.)

MapReduce Strategy
The implementation of the MapReduce strategy shown to the left uses two subplans, shown in more detail below.  The first, MapBlock, is replicated for each file block (indicated by the forall asterisk), and (as shown shortly) each replicant splits and maps the records in its block and categorizes each of those result records by key into MappedKeyedRecs, the first dimension of which represents a key, and the second is open/infinite, to accommodate multiple records with the same key.  MapBlock also changes the control state (to yellow) of KeyResults corresponding to any key it produces, in effect reserving a spot in that array for when the reduced result is produced for that key.  The other strategy, ReduceKey, is activated for each key thus reserved, and reduces all the records (from MappedKeyedRecs) having that key, storing the result into the appropriate spot in KeyResults only when (the control state of) KeyStatus indicates that there are no further records for that key.  This decomposition not only matches the standard intuition (i.e. map and reduce), but also eases the embedding (on platforms which benefit from it) by grouping all of the elements which should be colocated for efficiency reasons:  MapBlock can easily be colocated with the block being processed, and ReduceKey can be colocated with all of the records with that key being reduced.  The communication between those steps occurs implicitly, where the data moves from one to the other (via MappedKeyedRecs).  More about that later.

(ERRATA:  The binding modifiers, "1", on MapBlock and ReduceKey should technically select the key field from the record, e.g. each be replaced with "1=key".)

That still leaves the $null circle at the bottom of the strategy.  One of the difficulties inherent in concurrent planning/programming is determining when an activity has finished, or should finish.  While completion of later stages in a sequential plan implies that earlier stages are also complete, concurrent plans like this can also concurrently have data at different stages of processing.  Although we could keep any reduction from even beginning until all data had been read (split) and mapped, we want to make this as general and concurrent as possible, so will instead just ensure that a reduction for any particular key won't complete until all the records with that key have been mapped.  That's what the $null activity does here:  It activates only when all of the file blocks have been fully processed (indicated by BlocksDone) and, for any key,  (a) a spot has been reserved for that key's final result (in KeyResults), and (b) there are no records with that key waiting to be reduced (in MappedKeyedRecs).  In that case, the $null changes the control state of KeyStatus for that key, indicating to ReduceKey that it can finish reducing that key.

All of the new constructs introduced in the previous blog post ("ScalPL 1.1") are utilized here.  All of the arrays except for BlocksDone and FileContent are indexed in at least one dimension by a key.  This is strictly illegal (except for integer keys) in the book, but is allowed for keys of any type by #1 of that blog post.  The role binding from $null to MappedKeyedRecs is to (effectively) infinite elements, checking whether all of them are empty (green control state), which was illegal by the book, but allowed in #2 of the blog post.  And both ReduceKey and $null use the construct just introduced in #3 of the blog post, the open binding form of bindany, to find some key which allows those activities to bind, especially using the control state of KeyResults.  (Note that the binding from ReduceKey to KeyResults uses an activation binding, with the @ prefix, as described in the book section 7.3.2, to ensure that the plan will not activate until that role is ready.)

Now to delve into the substrategies, MapBlock and ReduceKey.

MapBlock Strategy
The MapBlock plan (strategy), shown here, maps all of the records in a particular block.  In it, a split plan finds record boundaries from the block (as quickly as possible) and passes those boundaries on (in no specific order, via recranges) to a Map plan, updates BlockPtr with the unchecked part of the block, and finishes.  The Map plan activates once for each record passed to it from split, processes/filters it accordingly, and passes the results (zero or more records) to another array, outrecs, again in no specific order.  [ERRATA:  Map should change the control state of MappedRecs to red.]  The $copy activity (below it) then takes the records from MappedRecs and uses the key on each record to index KeyFound, transitioning the control state of that element to flag that that key was found, and to index the first dimension of MappedKeyedRecs, where the record is deposited.  (A bindany is used for the second dimension of MappedKeyedRecs, to allow any number of records to be deposited there with the same key.)  As the book notes, $copy is generally not required to really physically copy anything here, it just renames the record contents as belonging to the new resource (array/element).

The $null activity in the lower left activates, and transitions BlockDone, when no other records will be produced (on MappedKeyedRecs) from this file block -- specifically, when (a) BlockPtr is empty by virtue of split hitting the end of the block, (b) RecBoundaries has no records waiting to be mapped, and (c) MappedRecs has no records waiting to be categorized by key.

ReduceKey Strategy
So that brings us (finally) to ReduceKey, which will be activated once for every key.  It consists of two parts:  Reduce1, which activates for every record with the given key, modifying PartialKeyResult with that record's contribution; and Reduce2, which activates once after all of the records with that key have been processed (as controlled by KeyStatus), producing the final result for that key to KeyResult.

Plans split, Map, Reduce1, and Reduce2 are not shown here.  Most or all may be implemented as tactics, e.g. in a typical computer language, and hopefully there is enough insight provided here to understand how these would simply and naturally be constructed.  The last three (at least) would be customized for the problem being solved, so optimally, they would be parameters themselves -- i.e. they would be fed into the plans using roles, and plugged into their respective contexts using $plan roles.  I have omitted those details here to focus more directly on the flow of the data.  (Another way of achieving that focus, using current ScalPL tools, would have been to simply put them on a separate drawing layer, and then hide or highlight different drawing layers at different times.)

And in a sense, that is the entire point of this post.   These plans/strategies show how these four simple data transformations fit together with appropriate communications to create the Map-Reduce paradigm.  It is not just a high-level hand-waving approximate solution, it is an actual executable solution, given an appropriate run-time system.

What has not been discussed here is specific communication paradigms or mechanisms to make this all work -- something which many would consider central to the whole Map-Reduce paradigm.  The idea is that, as discussed in 5.3.1 of the book, on some particular architecture, the planner (or other helper) would annotate the contexts and/or resources in these strategies to denote where the activities and/or content state would reside at various times.  For example, in a distributed architecture, the contexts in the MapReduce strategy above would be annotated, each MapBlock context replicant tied to the block (i.e. block index) which it processes, and the ReduceKey context location computed based on the key which it processes.  The communication, then, would occur as the content state of MappedKeyedRecs moves between these embeddings.  In the best case, that communication strategy could be optimized automatically, but some human guidance/hints would not be out of the question.  The important point here, however, is that these implementation details (via these annotations) occur separate from the raw algorithms shown here, which are independent of platform.





Wednesday, January 29, 2014

ScalPL 1.1: A few enhancements to the book

[This post was slightly edited on 1/30/14 to correct errors in the examples in part 4.]

In devising any language, there are trade-offs between utility, precision, consistency, and implementability.  Scalable Planning Language (ScalPL) as defined in the book "Scalable Planning" (the version which I'll here refer to as ScalPL 1.0) is no different, and when there have been questions, I've decided to err on the conservative side:  If I couldn't convince myself that a construct was useful, precise, and fairly efficiently implementable, I typically ruled against adding it, or I have added a rule preventing its use in problematic ways.  In some cases, I have overdone it, and set out here to correct some of those, but I make few apologies:  It's easier to add constructs (or constructions) over time than to first say they are available in the language and remove them after people begin to rely on them.

Here are four constructs that I have now decided should be part of ScalPL, even though the book either does not describe them, or explicitly rules against them.  I'll heretofore call the version of ScalPL with these included ScalPL 1.1.  If you are not already intimately familiar with ScalPL 1.0 (e.g. haven't read the book), these will mean little to you (since they haven't been introduced in such detail here on the blog yet), and other posts on this blog will continue to be self-explanatory.  Some of these have been long considered (since before the book was written), and left out primarily because I didn't realize how useful or implementable they might be.  Others were simply poorly considered.  Most of them deal with resource arrays.

The four new constructs are: (1) Resource arrays that allow for non-integer indices, (2) bindall role bindings for resource arrays permitted for actons/tactics (formerly outlawed), (3) a more useful form of "bindany" role binding for resource arrays (to provide access to the index being bound to), and (4) a new form of role binding that is unpredictable (though it would normally be considered predictable).

1. New Rule:  Resource arrays allow any type for indices

In section 13.1.1 of the book (page 230), when introducing arrays, it says (in the second sentence):
Each element is uniquely identified/addressed within the array by an index consisting of a tuple (i.e. fixed-length sequence) of integers.
Since an index is only an addressing scheme to uniquely identify elements, and there is no constraint on the number of elements in an array or how those elements are to be stored relative to one another, the constraint that indices consist only of integers is not justified, and any number of languages (even perl) have demonstrated the productive use of non-integer indices.  Since any integer is legal in ScalPL, a work-around (to use other kinds of indices) simply requires finding a function to map the indices you want to use onto the set of integers (that is, a bijection).  For example, if you want to allow character strings, you could simply use the character values in some code (e.g. ASCII) concatenated together.

Since ScalPL must already permit efficient storage of sparse arrays, it must already manage the storage of non-sequentially-indexed elements, so there is really no significant benefit to limiting indices to be integers.  In ScalPL 1.1, any type of indices will be allowed.

2. New rule:  Bindall bindings should be permitted for some actons (tactics)

In section 13.1.2 (page 232), I say that by default, binding a role of a context to an array (called a "bindall" binding) binds to all (potentially infinity) elements of that array.  Then in the second paragraph, first line, I say:
Because of this, the activity within the context in this (default, infinite) case cannot be an acton, because that would require infinitely elements to become ready before the acton could activate, and the activation would need to result in issuing a transition to all infinity of them, taking infinite time (in the general case).
I have decided that there are certainly cases where working around this constraint would be much more difficult than just removing it.  Specifically, if the binding has nodata (i.e. neither observe nor replace) permission, then such a binding would only require that the Director check and (if necessary) transition all of the control states of all the elements at once.  Even restricting such tactic/acton bindall bindings to activate on, and transition to, only the initial control state would be sufficient to provide important functionality -- e.g. to allow an acton (perhaps $null) to activate only if an array is "empty", in the sense that all of its control states have the initial control state.  This could be accomplished without this extension by having the planner explicitly maintain a count, in a separate resource, of the number of elements having or not having a particular control state, and manually update it each time the control state of an element changes.  But since that count is already effectively maintained in the internal implementation, this is essentially wasted and confusing effort.

3. New Rule:  A new form of binding, similar to bindany

In 13.1.3 (and Figure 12.3(b)), I introduce the bindany binding, which uses a ? prefix and allows binding to any element of the array having the control state specified by its access dot.  It has at least two shortcomings: That the activity within the context cannot detect the index of the element which is eventually bound to the role; and that constraints on indices cannot be made between different roles, such as that those roles must all be bound to the same index.  In the past, I've considered addressing the first of these by offering a special function within tactics/actons which would retrieve the index to which the role which was bound, but making such information about the binding (including the possibility that the role might not be bound to an array at all) available to the activity would violate the important concept that an activity's behavior is a function of only its observable content states.


It turns out that both shortcomings can be fairly easily addressed with the notation shown here, which allows a "free" role (bound to nothing, _x and y here) to be used as the primary for one or more reduction binding modifiers (as defined in 13.2 and 13.2.1).  If (and only if) there is some content state for this role (i.e. an index) such that the reductions result in secondary roles bound to ready array elements (i.e. with control state matching their access dot), then the role will be considered ready, and that content state will be observed on the role.  As described in 1, above, the content may be of type other than integer. This is somewhat similar to the "special function" approach described in the previous paragraph, but because the index information is delivered to the activity on a role (like all other legitimate information to an activity), it breaks no rules and is applicable to both strategies and tactics.  Note that _x in the example is anonymous, so is not accessible by the activity in the context, but it still ensures that the first index of resource array a equals the second index of array b, as per the associated reduction binding modifiers.

4. New Rule:  Suppressed predictability

Throughout the book, the concept of predictability is discussed.  Specifically, a role binding is considered predictable if and only if it has no replace permission (i.e. it either has nodata or observe permission only) and all of the transition bindings are to a single control state (i.e. all of the transition dots have the same color/pattern).  The reasoning is that, when a plan accesses a resource through a predictable role binding, its almost obvious what the new content and control state of the resource will be after the access:  i.e. the same (unchanged) content state, and the control state associated with the transition binding(s).  Seemingly, the only reason the Director (or human observer) can't just predict these in advance (i.e. the reason for the "almost" above) is that the activity could fail to issue any transition (sometimes called issuing the bottom transition), leaving the resource unavailable altogether.  So ScalPL actually defines that a transition will eventually be issued for any predictable role.  This permits some very important and useful optimizations, often performed by the Director, as detailed in Chapter 4, by pretending that the transition has been issued even before it has.  These optimizations include efficient sharing and buffering, even with no effort expended by the planner.

This is especially useful for atomic activities (e.g. actons, resulting from activating tactics). The problem is, there are other cases (using strategies) when it is actually useful for the optimization not to be done, for it not to be assumed that the transition will always be issued, or at least, not at just any old time.  A common reason for wanting to not assume this is that the relative timing of that transition relative to others issued by the same (non-atomic) activity is important -- such as that the transition should only occur after other transitions (on other roles) have been issued from that activity.  Future editions of the book will go into more depth on this issue.

There are already "work-around" ways to explicitly suppress these predictable optimizations in these troublesome cases -- making the role binding "unpredictable" -- such as by simply changing nodata or observe permissions on those role bindings to replace (observe and update) permission, or by adding extra unused transitions to roles of a plan, which can be bound to unused control states in their parents.  But these are obtuse, in effect fooling not only the Director about what is actually happening, but human reviewers as well.


So I hereby propose some new "unpredictable" syntax which simply declares that a role binding which otherwise looks predictable should nonetheless not be predicted.  A question mark (?) is especially good for suggesting unpredictability, but is already used as a role name prefix for "bindany" bindings.  So for unpredictability, I will suggest putting a question mark next to the transition binding itself, within the resource rectangle, as show here (left top).  If colors are used to signify transition bindings, a colored question mark could potentially be used in place of the transition binding dot (as on the left bottom).

It also seems useful to declare within a strategy that any parent which binds this strategy (within a context) should recognize that certain roles are intended to be unpredictable, and that those roles should therefore be bound with unpredictable bindings.  I will suggest two different potential syntaxes:  Either prefixing the formal resource name itself with a question mark, or including a question mark in the transition box of the formal resource, as shown here (which represents a possible xstrategy for the context in the previous example).  Since this construct is sensibly only permitted for formal resources with a single dot in their transition box, the dot there could (again) itself be replaced with a colored question mark (if colors are used).

I still have some reservations about adding this construction, because I fear it may be over-used.  Specifically, even a predictable role for a strategy cannot be predicted to perform a transition before that role (i.e. its formal resource) is even accessed by an activity within the strategy.  So a declaration of unpredictability is really only needed when a formal resource within the strategy must be accessed early (e.g. observed), but the transition from that resource should be delayed until later (like a in the example, which may be accessed several times by multiple before finally completed by wrapup).

Conclusion

Again, these are described primarily for those who have read the book.  The use of many of these will described (hopefully shortly) in upcoming blog posts.