Saturday, March 22, 2014

Cholesky Factorization in ScalPL

(This replaces two former posts with this approximate title, which have been merged and cleaned up here.) [Errata: In first diagram (Cholesky strategy), the dot next to "Altered" in lower left corner should be yellow instead of red.  And in third diagram (UpdateRow strategy), it should have been mentioned that the binding modifier rules were omitted because they are all the same: "1" (i.e. reduction in the first dimension).]

In this post, I'll address an implementation of tiled Cholesky factorization using ScalPL.  I've chosen this problem specifically because it is often used as a "real world" example when demonstrating various parallel programming models or languages.  In that vein, two references which seem especially clear (and which may serve for future comparisons) include a Concurrent Collections (CnC) implementation (Knobe) and a NUMA implementation (Jeannot).  It should be noted that the latter effectively computes the result in place, while the former apparently doesn't, but since data can be assumed to move between processors in both cases, the notion of being "in place" is not as clear as one might expect.  Other useful references include wikipedia and Matlab descriptions.

First, the general algorithm in English as I understand it. (If I make mistakes here, I expect them to be minor enough not to lose the overall point, and I entrust commenters to straighten me out: It is surely due to errors in my translation, and not in these original references.)  It iterates over the lower diagonal of a symmetric square matrix -- i.e. the elements (i,j) where j (the column) is less than or equal to i (the row), with the assumption that the elements above that diagonal are equal to their "reflected" counterparts below.  The matrix is then assumed to be tiled, so that each element of the matrix (array) operated upon by the algorithm actually contains a fixed-size square tile of the original, larger matrix, but this aspect is below the level of detail addressed here.

The algorithm:  Each iteration first computes the element (i.e. tile) in upper left corner (1,1) with a routine (or "tactic", in ScalPL) called cholesky here (DPOTRF in the Jeannot paper, which uses the LAPACK kernels).  This computation relies only on the previous value of this element to produce the new value.  The algorithm then computes each of the other elements in the first column (i.e. directly below that top corner) with a routine called trisolve here (DTRSM in the Jeannot paper), updating each element using the new value in 1,1 as well as its own former value.  And then, whenever the appropriate inputs are available, the algorithm computes all the other elements (i.e. those not in the first column), where element i,j is updated using its previous value and elements (i,1) and (j,1) from the first column.  For non-diagonal elements, routine update here (DGEMM in the Jeannot paper) is used for this, and for diagonal elements (for which only one value from the first column is used, since i=j), routine updateDiag (DSYRK in Jeannot) is used.  When enough data has been computed from this iteration, the next iteration (or in our case, recursion) can begin, which works identically except that the whole computation is shifted diagonally down one, so that element 2,2 from one iteration is used as the new top corner of the matrix in the next iteration, so the matrix it operates on is smaller by one in both dimensions.  That means that no element of the first column of any iteration (computed by cholesky and trisolve) will ever be accessed by subsequent iterations, so those results can theoretically be used by (passed back to) the caller ASAP, such as if there is more potential concurrency there.  Iterating/recursing continues until an iteration operates on a 1x1 matrix, at which point cholesky is performed on that element and we're done.

A top level ScalPL strategy (i.e. plan or program) to accomplish this is shown below. The matrix of tiles is represented by the resource rectangle at the bottom (labeled Tiles), with two hash marks on the left to represent the two dimensions, and the invocation of cholesky, trisolve, and Updates are represented by the three circles with those labels.  (We'll use the convention that capitalized names like Update represent strategies, uncapitalized represent tactics.)  The circle at the right (which is unlabeled due to having a role bound to, and thus named, $plan) represents the recursive invocation to the next iteration.  The circle at the left (truncateRange) simply initializes resource TwoToN with the range of indices of the matrix it is operating upon except for the first (i.e. 2 to N, where N is the matrix size), to provide indices for trisolve and Updates.  The remaining circle (containing the $null tactic to the right of cholesky) is to "short circuit" and return the cholesky result in the final 1x1 case, since trisolve and Updates won't be invoked in that case.  The control state of TwoToN signifies whether this is that final case (in which case it is Only1Row/red) or the normal case (where it is AtLeast2Rows/yellow).

Cholesky strategy (click to enlarge)
As is standard, the binding modifiers (arrows within the circles) carry indices (from their tail, or primary role) to the array being indexed (at their head, or secondary role), with the binding rules (within the dashed boxes) denoting which dimensions are being indexed.  A new construction of parentheses within a binding rule is being used here (part of "ScalPL 1.1"), which means that a single source index is used for multiple dimensions, so, for example, cholesky takes the constant 1 on its primary as both the first and second dimension of its secondary, thus binding that secondary role (Tile) to resource element Tiles(1,1).  Everything else is "by the book".  The "dupall" asterisk on trisolve's top role (to TwoToN) replicates that context for each integer in that range 2 to N, then uses each integer in that range as an index (specified by the green binding modifier) to the first dimension of Tiles for its Lower role (green arrow/rule), and the constant one for its second dimension (red arrow/rule), thus binding each replicant to a different element of the first column.  Similarly, the Column1 role of Updates is bound to the first column of Tiles (rows TwoToN in green, column 1 in red), and the Rest role is bound to everything else (rows and columns TwoToN in black).    The binding rule on the rightmost circle/context (i.e. the recursive call to Cholesky) represents that the first and second dimensions of Tiles are effectively shifted/translated by 1 (the constant on the primary) so that the next iteration can continue to address the "new" first column as 1, etc.

The main difficulty to be addressed in any solution is in obtaining maximum concurrency while still ensuring that elements are not being accessed/read before or during the time they're updated/written.  Specifically, that includes not letting any trisolves in an iteration access element (1,1) before it is established by cholesky, or any of the update routines in an iteration access elements of the first column before they are established by trisolve, or letting subsequent iterations access elements still being established by this iteration, all while also allowing elements of the first column to be modified by the caller (by virtue of having been returned early) before the update routines have finished accessing them.  Control states in ScalPL are purpose-built for handling this sort of issue, and in this case, those on Tiles do a good job:  LastIter (green) for elements not yet updated, Altered (yellow) for those updated but perhaps still needing to be accessed, and returned (red) for passing them back to the parent/activator when finished.

Updates strategy
As indicated by the bindings in the Cholesky strategy above, the Updates strategy (at right) gets elements 2 to N of the first column of the Tiles matrix (which it will only observe/read) as Column1, and the rest of the matrix to its right (some of which it will update) as Rest, as well as the range of indices being represented as TwoToN.  The dupall on the RowNum role replicates the context/circle activating the UpdateRow strategy once for every row 2 to N, supplying the unique row number to each replicant on RowNum, and binding the Row role for each replicant to that row.  The Column1 role for all replicants are bound to the same first column.

UpdateRow strategy
So each replicant of the UpdateRow strategy (at left) gets the first column (as Column1), one row of the rest (as Row), and the row number represented as RowNum.  Tactic internalIndices  initializes TwoToRowNumMinus1 with (yes) the range 2 to RowNum-1.  The left context/circle activates one replicant of update for each integer in that range -- i.e. for each element in the row except the last (which is in the diagonal) -- binding role ToUpdate to that row element, role Upper to the corresponding element of Column1, and role Lower to the element of Column1 corresponding to the row (RowNum).  The right context/circle activates updateDiag just once for the last element of the row (in the diagonal), binding FromColumn to the corresponding element of Column1 and ToUpdate to the element of Row.

These diagrams basically are the program.  All that's left is the coding of the tactics (cholesky, trisolve, update, updateDiag, truncateRange, and internalIndices) in a base language like C or Fortran.  The first four are likely simple calls to standard kernels as in the Jeannot paper (e.g. DPOTRF, DTRSM, DGEMM, and DSYRK, respectively).  The last two are a few lines each, e.g. something like:
internalIndices {TwoToRowNumMinus1->beg = 2; TwoToRowNumMinus1->end = *RowNum - 1;}
One of the trickiest aspects of making a truly portable implementation is in deciding what to do with the first column of an iteration.  If there is potential concurrency in the calling parent/mainline, there may be some merit to returning that first column to the caller soon after it has been computed by cholesky and trisolve, otherwise not.  But even if so, how soon after?  Since that first column must still be read by update and updateDiag steps in the iteration, should return of the column wait until these are done, or should a copy be made for these or for the caller to facilitate an early return?  (On a message-passing system, the copy may be made anyway by virtue of communicating.)  Or maybe access should be returned to the caller immediately, with a "copy on write" made only if the caller then tries to modify the elements while they are still being accessed by the update and updateDiag steps?  Regardless of the answers we choose, building them into the algorithm will make it less portable for cases where they would be better answered differently.

Fortunately, these are exactly the issues addressed by carefully established semantics of predictability in ScalPL.  In this particular case, in the Cholesky strategy, we see that the Column1 role of Updates is predictable (with read permission and one transition), meaning that once something within Updates accesses an element, that transition will eventually be issued in Cholesky for that element, even if Updates itself doesn't issue it -- and indeed, it never does in this case.  This leaves the Director (i.e. runtime) flexibility to wait for any length of time, and/or make copies/snapshots always or on demand/write.  These decisions can be made in realtime, or (perhaps in many cases) via hints/annotations by a human planner which do not alter the behavior.

One more minor/technical point.  In this example, Column1 and TwoToN in Updates are shown as activation formals, represented with filled triangles.  This is primarily to simplify coding by allowing transition back to the initial/green control state, which is not ordinarily allowed for formal resources.  This approach does have the slight downside that Updates cannot activate until all of the elements of the first column are Altered/yellow.  (This also means that the entire Column1 role is effectively accessed by Updates as soon as it activates/binds, so predictable transitions will eventually result to all of those elements.)  The use of activation formals here could be avoided by adding another control state to Column1 and TwoToN, and adding a $null activity to each to transition from green to that new control state, which would then logically be the new initial control state, as suggested in Figure 9.9 in the book.)

I am personally quite happy with this example.  It demonstrates the flexibility of the binding modifiers, even when dealing with triangular arrays, and of the predictability rules in solving very real issues.  Overall, I believe the diagrams are highly intuitive, and carry at least as much information as a similar page area of text would.  And I believe that this demonstrates advantages of ScalPL over other approaches for scalability, portability, correctness, and formality.

Monday, March 03, 2014

Big Data, Map-Reduce, and ScalPL

There is Big Focus on Big Data these days, and specifically tools (like Hadoop) that use approaches like Map Reduce.  The concept is fairly simple:  If one has a distributed file (or other data structure), partitioned so that all data within a particular partition is relatively local, but partitions may be relatively distant from one another, many useful operations can be performed efficiently using a pattern where items from each partition are first processed ("mapped") individually and locally, then those independent results are distributed (hashed, binned) according to some key field so that items with like keys end up physically together or close (e.g. via implied communication), where they can then be combined ("reduced") with other items having the same key.

If ScalPL is supposed to be expressive enough to capture patterns, then it should be able to capture this one, and this popular use-case can provide some insight into the expressiveness and flexibility of ScalPL.  Optimally, the ScalPL plan will accommodate the benefits and ordering of map-reduce just mentioned for a distributed platform/architecture, but will also work on different platforms with different characteristics where it might, perhaps, benefit from more pipelining, less communication, etc.   Remember, with a proper runtime system ("Director"), a ScalPL plan is executable -- a program -- rather than just a first step to a final implementation.

Possible High-Level Context for MapReduce
Before we get started with the MapReduce plan itself, a few words about the context in which it will activate.  At this point, there is no standard (object) library in ScalPL for distributed files (or any other kind, for that matter), though constructs exist to build/standardize some.  More about that in another post.  For our purposes, the contents of a distributed file in ScalPL (FileContent in the plan here) will simply be modeled as a two dimensional resource array, with the first dimension being the block, and the second being the unit (e.g. byte) within the block, with special byte markers for end of record and end of block.  ScalPL does not itself dictate how the elements of an array are physically stored relative to one another, but for efficiency purposes here, they can easily be stored just as a normal file system would (e.g. as contiguous characters within each block, and scattered blocks).

The blocks (of course) may not be stored or indexed contiguously at some low level, but by the time the MapReduce plan accesses them, they can be made to appear sequential.  In the example above, FileBlockIndices is a sequence of indices to the blocks in the file, and FileBlockIndexCount is the number of (or range) of where those indices are stored in FileBlockIndices.  FileBlockSize is the maximum number of bytes per block.  By using the ScalPL mapping binding modifier (1/1 here) and the anonymous (underscore) role binding from FileBlockIndices, the MapReduce plan just sees (on FileContent) a sequential collection of blocks numbered sequentially within FileBlockIndexCount.  We have that plan producing its results to KeyResults, indexed by key, and changing the control state of that key to red.  (If a result is not produced for a given key, the control state for that key remains green.)

MapReduce Strategy
The implementation of the MapReduce strategy shown to the left uses two subplans, shown in more detail below.  The first, MapBlock, is replicated for each file block (indicated by the forall asterisk), and (as shown shortly) each replicant splits and maps the records in its block and categorizes each of those result records by key into MappedKeyedRecs, the first dimension of which represents a key, and the second is open/infinite, to accommodate multiple records with the same key.  MapBlock also changes the control state (to yellow) of KeyResults corresponding to any key it produces, in effect reserving a spot in that array for when the reduced result is produced for that key.  The other strategy, ReduceKey, is activated for each key thus reserved, and reduces all the records (from MappedKeyedRecs) having that key, storing the result into the appropriate spot in KeyResults only when (the control state of) KeyStatus indicates that there are no further records for that key.  This decomposition not only matches the standard intuition (i.e. map and reduce), but also eases the embedding (on platforms which benefit from it) by grouping all of the elements which should be colocated for efficiency reasons:  MapBlock can easily be colocated with the block being processed, and ReduceKey can be colocated with all of the records with that key being reduced.  The communication between those steps occurs implicitly, where the data moves from one to the other (via MappedKeyedRecs).  More about that later.

(ERRATA:  The binding modifiers, "1", on MapBlock and ReduceKey should technically select the key field from the record, e.g. each be replaced with "1=key".)

That still leaves the $null circle at the bottom of the strategy.  One of the difficulties inherent in concurrent planning/programming is determining when an activity has finished, or should finish.  While completion of later stages in a sequential plan implies that earlier stages are also complete, concurrent plans like this can also concurrently have data at different stages of processing.  Although we could keep any reduction from even beginning until all data had been read (split) and mapped, we want to make this as general and concurrent as possible, so will instead just ensure that a reduction for any particular key won't complete until all the records with that key have been mapped.  That's what the $null activity does here:  It activates only when all of the file blocks have been fully processed (indicated by BlocksDone) and, for any key,  (a) a spot has been reserved for that key's final result (in KeyResults), and (b) there are no records with that key waiting to be reduced (in MappedKeyedRecs).  In that case, the $null changes the control state of KeyStatus for that key, indicating to ReduceKey that it can finish reducing that key.

All of the new constructs introduced in the previous blog post ("ScalPL 1.1") are utilized here.  All of the arrays except for BlocksDone and FileContent are indexed in at least one dimension by a key.  This is strictly illegal (except for integer keys) in the book, but is allowed for keys of any type by #1 of that blog post.  The role binding from $null to MappedKeyedRecs is to (effectively) infinite elements, checking whether all of them are empty (green control state), which was illegal by the book, but allowed in #2 of the blog post.  And both ReduceKey and $null use the construct just introduced in #3 of the blog post, the open binding form of bindany, to find some key which allows those activities to bind, especially using the control state of KeyResults.  (Note that the binding from ReduceKey to KeyResults uses an activation binding, with the @ prefix, as described in the book section 7.3.2, to ensure that the plan will not activate until that role is ready.)

Now to delve into the substrategies, MapBlock and ReduceKey.

MapBlock Strategy
The MapBlock plan (strategy), shown here, maps all of the records in a particular block.  In it, a split plan finds record boundaries from the block (as quickly as possible) and passes those boundaries on (in no specific order, via recranges) to a Map plan, updates BlockPtr with the unchecked part of the block, and finishes.  The Map plan activates once for each record passed to it from split, processes/filters it accordingly, and passes the results (zero or more records) to another array, outrecs, again in no specific order.  [ERRATA:  Map should change the control state of MappedRecs to red.]  The $copy activity (below it) then takes the records from MappedRecs and uses the key on each record to index KeyFound, transitioning the control state of that element to flag that that key was found, and to index the first dimension of MappedKeyedRecs, where the record is deposited.  (A bindany is used for the second dimension of MappedKeyedRecs, to allow any number of records to be deposited there with the same key.)  As the book notes, $copy is generally not required to really physically copy anything here, it just renames the record contents as belonging to the new resource (array/element).

The $null activity in the lower left activates, and transitions BlockDone, when no other records will be produced (on MappedKeyedRecs) from this file block -- specifically, when (a) BlockPtr is empty by virtue of split hitting the end of the block, (b) RecBoundaries has no records waiting to be mapped, and (c) MappedRecs has no records waiting to be categorized by key.

ReduceKey Strategy
So that brings us (finally) to ReduceKey, which will be activated once for every key.  It consists of two parts:  Reduce1, which activates for every record with the given key, modifying PartialKeyResult with that record's contribution; and Reduce2, which activates once after all of the records with that key have been processed (as controlled by KeyStatus), producing the final result for that key to KeyResult.

Plans split, Map, Reduce1, and Reduce2 are not shown here.  Most or all may be implemented as tactics, e.g. in a typical computer language, and hopefully there is enough insight provided here to understand how these would simply and naturally be constructed.  The last three (at least) would be customized for the problem being solved, so optimally, they would be parameters themselves -- i.e. they would be fed into the plans using roles, and plugged into their respective contexts using $plan roles.  I have omitted those details here to focus more directly on the flow of the data.  (Another way of achieving that focus, using current ScalPL tools, would have been to simply put them on a separate drawing layer, and then hide or highlight different drawing layers at different times.)

And in a sense, that is the entire point of this post.   These plans/strategies show how these four simple data transformations fit together with appropriate communications to create the Map-Reduce paradigm.  It is not just a high-level hand-waving approximate solution, it is an actual executable solution, given an appropriate run-time system.

What has not been discussed here is specific communication paradigms or mechanisms to make this all work -- something which many would consider central to the whole Map-Reduce paradigm.  The idea is that, as discussed in 5.3.1 of the book, on some particular architecture, the planner (or other helper) would annotate the contexts and/or resources in these strategies to denote where the activities and/or content state would reside at various times.  For example, in a distributed architecture, the contexts in the MapReduce strategy above would be annotated, each MapBlock context replicant tied to the block (i.e. block index) which it processes, and the ReduceKey context location computed based on the key which it processes.  The communication, then, would occur as the content state of MappedKeyedRecs moves between these embeddings.  In the best case, that communication strategy could be optimized automatically, but some human guidance/hints would not be out of the question.  The important point here, however, is that these implementation details (via these annotations) occur separate from the raw algorithms shown here, which are independent of platform.





Wednesday, January 29, 2014

ScalPL 1.1: A few enhancements to the book

[This post was slightly edited on 1/30/14 to correct errors in the examples in part 4.]

In devising any language, there are trade-offs between utility, precision, consistency, and implementability.  Scalable Planning Language (ScalPL) as defined in the book "Scalable Planning" (the version which I'll here refer to as ScalPL 1.0) is no different, and when there have been questions, I've decided to err on the conservative side:  If I couldn't convince myself that a construct was useful, precise, and fairly efficiently implementable, I typically ruled against adding it, or I have added a rule preventing its use in problematic ways.  In some cases, I have overdone it, and set out here to correct some of those, but I make few apologies:  It's easier to add constructs (or constructions) over time than to first say they are available in the language and remove them after people begin to rely on them.

Here are four constructs that I have now decided should be part of ScalPL, even though the book either does not describe them, or explicitly rules against them.  I'll heretofore call the version of ScalPL with these included ScalPL 1.1.  If you are not already intimately familiar with ScalPL 1.0 (e.g. haven't read the book), these will mean little to you (since they haven't been introduced in such detail here on the blog yet), and other posts on this blog will continue to be self-explanatory.  Some of these have been long considered (since before the book was written), and left out primarily because I didn't realize how useful or implementable they might be.  Others were simply poorly considered.  Most of them deal with resource arrays.

The four new constructs are: (1) Resource arrays that allow for non-integer indices, (2) bindall role bindings for resource arrays permitted for actons/tactics (formerly outlawed), (3) a more useful form of "bindany" role binding for resource arrays (to provide access to the index being bound to), and (4) a new form of role binding that is unpredictable (though it would normally be considered predictable).

1. New Rule:  Resource arrays allow any type for indices

In section 13.1.1 of the book (page 230), when introducing arrays, it says (in the second sentence):
Each element is uniquely identified/addressed within the array by an index consisting of a tuple (i.e. fixed-length sequence) of integers.
Since an index is only an addressing scheme to uniquely identify elements, and there is no constraint on the number of elements in an array or how those elements are to be stored relative to one another, the constraint that indices consist only of integers is not justified, and any number of languages (even perl) have demonstrated the productive use of non-integer indices.  Since any integer is legal in ScalPL, a work-around (to use other kinds of indices) simply requires finding a function to map the indices you want to use onto the set of integers (that is, a bijection).  For example, if you want to allow character strings, you could simply use the character values in some code (e.g. ASCII) concatenated together.

Since ScalPL must already permit efficient storage of sparse arrays, it must already manage the storage of non-sequentially-indexed elements, so there is really no significant benefit to limiting indices to be integers.  In ScalPL 1.1, any type of indices will be allowed.

2. New rule:  Bindall bindings should be permitted for some actons (tactics)

In section 13.1.2 (page 232), I say that by default, binding a role of a context to an array (called a "bindall" binding) binds to all (potentially infinity) elements of that array.  Then in the second paragraph, first line, I say:
Because of this, the activity within the context in this (default, infinite) case cannot be an acton, because that would require infinitely elements to become ready before the acton could activate, and the activation would need to result in issuing a transition to all infinity of them, taking infinite time (in the general case).
I have decided that there are certainly cases where working around this constraint would be much more difficult than just removing it.  Specifically, if the binding has nodata (i.e. neither observe nor replace) permission, then such a binding would only require that the Director check and (if necessary) transition all of the control states of all the elements at once.  Even restricting such tactic/acton bindall bindings to activate on, and transition to, only the initial control state would be sufficient to provide important functionality -- e.g. to allow an acton (perhaps $null) to activate only if an array is "empty", in the sense that all of its control states have the initial control state.  This could be accomplished without this extension by having the planner explicitly maintain a count, in a separate resource, of the number of elements having or not having a particular control state, and manually update it each time the control state of an element changes.  But since that count is already effectively maintained in the internal implementation, this is essentially wasted and confusing effort.

3. New Rule:  A new form of binding, similar to bindany

In 13.1.3 (and Figure 12.3(b)), I introduce the bindany binding, which uses a ? prefix and allows binding to any element of the array having the control state specified by its access dot.  It has at least two shortcomings: That the activity within the context cannot detect the index of the element which is eventually bound to the role; and that constraints on indices cannot be made between different roles, such as that those roles must all be bound to the same index.  In the past, I've considered addressing the first of these by offering a special function within tactics/actons which would retrieve the index to which the role which was bound, but making such information about the binding (including the possibility that the role might not be bound to an array at all) available to the activity would violate the important concept that an activity's behavior is a function of only its observable content states.


It turns out that both shortcomings can be fairly easily addressed with the notation shown here, which allows a "free" role (bound to nothing, _x and y here) to be used as the primary for one or more reduction binding modifiers (as defined in 13.2 and 13.2.1).  If (and only if) there is some content state for this role (i.e. an index) such that the reductions result in secondary roles bound to ready array elements (i.e. with control state matching their access dot), then the role will be considered ready, and that content state will be observed on the role.  As described in 1, above, the content may be of type other than integer. This is somewhat similar to the "special function" approach described in the previous paragraph, but because the index information is delivered to the activity on a role (like all other legitimate information to an activity), it breaks no rules and is applicable to both strategies and tactics.  Note that _x in the example is anonymous, so is not accessible by the activity in the context, but it still ensures that the first index of resource array a equals the second index of array b, as per the associated reduction binding modifiers.

4. New Rule:  Suppressed predictability

Throughout the book, the concept of predictability is discussed.  Specifically, a role binding is considered predictable if and only if it has no replace permission (i.e. it either has nodata or observe permission only) and all of the transition bindings are to a single control state (i.e. all of the transition dots have the same color/pattern).  The reasoning is that, when a plan accesses a resource through a predictable role binding, its almost obvious what the new content and control state of the resource will be after the access:  i.e. the same (unchanged) content state, and the control state associated with the transition binding(s).  Seemingly, the only reason the Director (or human observer) can't just predict these in advance (i.e. the reason for the "almost" above) is that the activity could fail to issue any transition (sometimes called issuing the bottom transition), leaving the resource unavailable altogether.  So ScalPL actually defines that a transition will eventually be issued for any predictable role.  This permits some very important and useful optimizations, often performed by the Director, as detailed in Chapter 4, by pretending that the transition has been issued even before it has.  These optimizations include efficient sharing and buffering, even with no effort expended by the planner.

This is especially useful for atomic activities (e.g. actons, resulting from activating tactics). The problem is, there are other cases (using strategies) when it is actually useful for the optimization not to be done, for it not to be assumed that the transition will always be issued, or at least, not at just any old time.  A common reason for wanting to not assume this is that the relative timing of that transition relative to others issued by the same (non-atomic) activity is important -- such as that the transition should only occur after other transitions (on other roles) have been issued from that activity.  Future editions of the book will go into more depth on this issue.

There are already "work-around" ways to explicitly suppress these predictable optimizations in these troublesome cases -- making the role binding "unpredictable" -- such as by simply changing nodata or observe permissions on those role bindings to replace (observe and update) permission, or by adding extra unused transitions to roles of a plan, which can be bound to unused control states in their parents.  But these are obtuse, in effect fooling not only the Director about what is actually happening, but human reviewers as well.


So I hereby propose some new "unpredictable" syntax which simply declares that a role binding which otherwise looks predictable should nonetheless not be predicted.  A question mark (?) is especially good for suggesting unpredictability, but is already used as a role name prefix for "bindany" bindings.  So for unpredictability, I will suggest putting a question mark next to the transition binding itself, within the resource rectangle, as show here (left top).  If colors are used to signify transition bindings, a colored question mark could potentially be used in place of the transition binding dot (as on the left bottom).

It also seems useful to declare within a strategy that any parent which binds this strategy (within a context) should recognize that certain roles are intended to be unpredictable, and that those roles should therefore be bound with unpredictable bindings.  I will suggest two different potential syntaxes:  Either prefixing the formal resource name itself with a question mark, or including a question mark in the transition box of the formal resource, as shown here (which represents a possible xstrategy for the context in the previous example).  Since this construct is sensibly only permitted for formal resources with a single dot in their transition box, the dot there could (again) itself be replaced with a colored question mark (if colors are used).

I still have some reservations about adding this construction, because I fear it may be over-used.  Specifically, even a predictable role for a strategy cannot be predicted to perform a transition before that role (i.e. its formal resource) is even accessed by an activity within the strategy.  So a declaration of unpredictability is really only needed when a formal resource within the strategy must be accessed early (e.g. observed), but the transition from that resource should be delayed until later (like a in the example, which may be accessed several times by multiple before finally completed by wrapup).

Conclusion

Again, these are described primarily for those who have read the book.  The use of many of these will described (hopefully shortly) in upcoming blog posts.




Wednesday, December 11, 2013

Actors and ScalPL, again

I begin this post by noting that the price of the Scalable Planning book (describing ScalPL) has now been reduced to $24.95.  This is likely as cheap as you can expect to see it for some time, just barely recouping costs, and about 40% of what it sold for a little over a year ago.  So now's your chance.  (The ebook version is waiting until I am certain I can produce a quality version, including the fairly significant math notation in chapter 10.)

Some time ago, I commented on a video by Carl Hewitt regarding the Actor model (primarily theoretical aspects), and I may still comment on that further... but this post is regarding another video on actors,  from a recent Halmstad Colloquium, by Hewitt's former (many years ago) student, Gul Agha, who wrote a book on the subject in 1986, and who has since has been a professor at UIUC.  (I've had the pleasure benefiting from Gul's gracious hospitality on a few occasions over the years.)  This particular talk regards more practical matters relating to the Actor model, with the benefit of significant hindsight.  Dr. Agha seems honest here about how some implementations have fudged the rules, and how those who haven't fudged them have paid some performance penalties.

First, in three slides, Agha summarizes some of Actors' important characteristics, and explains their merits. In the first (at 5:18), Actors are described as concurrent (with one another), communicating with asynchronous message passing and no shared state.  In the next slide (at 8:40), he outlines basic principles, specifically: Encapsulation (of state by an actor), fairness (a.k.a. liveness, a guarantee that each sent message will eventually be received), location transparency (the independence of an actor's behavior from its physical location), and mobility (apparently related to the location actually being flexible during execution).  In the third slide (at 18:10), Agha brings this all down to how an Actor is typically implemented:  as "encapsulated state + behavior ('methods') + thread of control + mailbox".

Some of the merits are argued to be decreased complexity (with message interleavings being the only potential source of nondeterminism), tolerance of communication latency (the possibility of "hiding" message latency with other computation during transport), and portability (with actor addresses being independent of location or platform).  State encapsulation within an actor is likened to data hiding, a mainstay of software engineering (and OO programming).  The model is also said to facilitate mobility (which I assume means during execution).

Some of those benefits are actually not so clear to me.  For example, regarding the fairness constraint facilitating correctness:  The Actor model permits the creation of unlimited number of messages to be directed to an actor, but then also requires that every message be eventually handled (by that actor).  This implies that, regardless of how quickly an actor can handle each message, it may not be fast enough to satisfy the model's requirements.  That's a problem:  Even beyond the requirement cited by Agha of unbounded queuing, the queue could actually grow infinitely (unless the parallelism can grow infinitely).  I won't venture into the case when an actor doesn't terminate, and how/if this is formally different than an actor just refusing to handle more messages and violating fairness.

Also, while location-independent addressing (of actions) may aid mobility, mobility can actually be thwarted by the fact that an actor can maintain an unlimited and unobvious amount of persistent internal state, even while the actor is not active (i.e. while awaiting another message).  It means there are essentially two kinds of state to be moved around, in completely different ways and under different circumstances:  The state within a message, and that within an actor.  And moving intra-actor state, especially to another kind of processor architecture, can be difficult at best.

Even some of the cited merits have corresponding drawbacks.  For example, while data hiding is an important software engineering principle (in sequential programming), part of the state hidden within an actor is the set of actors with which that actor may communicate.  So the entire communication and control topology of the program is hidden within actors -- and that is actually completely antithetical to engineering principles of structured systems.  And since there is no limitation on the size or persistence of the state within an actor, the only way to characterize that hidden state (including the state representing the actors with which it may communicate) is in terms of the sequence of all of the messages that the actor has ever received throughout its history.  This makes the propagation and transformation of state during an execution very difficult to track without knowing a lot about (and keeping track of) what's happening inside of each actor.

Dr. Agha also mentions the significant overhead associated with the required "by value" messages (with payload), which is avoidable in other models for communication between co-located procedures, in cases where they could just share access to a common area ("by reference").  Agha suggests a potential solution where static analysis could theoretically determine when an actual message transfer (copy) could be avoided, implementing sharing under the covers. This static analysis is much easier said than done, and something which has been proposed for decades with other models (e.g. Linda) while never very successfully implemented (without hardware support).

These issues are hardly new.  When I was devising the F-Nets computational model around 1990 (see chapter 10 of the Scalable Planning book for a rewrite of my 1991 dissertation), it was already addressing many of these drawbacks, seen in most message-passing programs, with the rest (and others) addressed by its more complete language, ScalPL (Scalable Planning Language), shortly thereafter.  (Described in the same book and many posts in this blog.)  And, ScalPL retains the advantages of Actors cited above, but in addition:
  • ScalPL more cleanly handles data hiding by offering two kinds of "plans" (i.e. program fragments), called "tactics" and "strategies".  Tactics, which are essentially functions or methods implemented with traditional sequential programming languages, have short-lived activations/evaluations ("actons"), and don't hide or maintain any state between activations; Strategies, are graphically built in ScalPL itself of other plans (tactics and strategies), and are generally long-lived and do maintain/hide state internally, in "resources".  Since strategies have clean interfaces, with just a few simple additional constructs, they also serve as classes and their instances (objects).
  • The stateless and functional nature of tactics in ScalPL, together with the clear graphical representation (within strategies) of how they interact with resources and therefore with each other, makes the flow of both state and control clear, to both the human and automated analyst, allowing optimizations for efficient data sharing and/or migration without delving into the innards of the tactic implementations.  These features make ScalPL potentially the closest representation to structured programming available in the concurrent/parallel world.  It is essentially an executable software engineering diagram.
  • Those same ScalPL resources also serve to preserve state (as necessary) for tactics between activations, with negligible overhead.  Since ScalPL strategies are composed only of these resources and other subplans (tactics and strategies) which access them, everything is ultimately built from just resources and tactics, leading to true mobility:  Everything that might need to be moved is either a tactic (which is stateless, and therefore basically constant data, between activations), a resource (which is designed for mobility, acting as a message when necessary), or a strategy (which is ultimately composed of the previous two).
  • Fairness in ScalPL takes the form "if a tactic is continuously activateable, then it eventually will activate".  Unlike Actors, since resources -- i.e. the connections between tactics -- are not (in essence) queues, data results of one component generally can't pile up unboundedly to be forced upon non-ready downstream tactics.  Instead, upstream components must wait (i.e. won't be activateable) for downstream results to be consumed unless circumstances permit queuing (explicitly or implicitly) of prior results.
  • In the video, Dr. Agha mentions the merits of sharing a thread among multiple actors, as an optimization, as opposed to one actor per thread.  In ScalPL, since tactics (actons) are atomic and maintain no state between activations, they are short-lived "threadlets", and an implementor is not even tempted to devote a long-lived thread to each one.
And with ScalPL, that's all with no mailboxes (or locks) to mess with, and an underlying formal computational model (similar to Petri nets), all while offering the other advantages claimed here for actors.  And although the tactics in ScalPL are described here as "atomic" and "functional", this does not imply overheads associated with checkpointing and roll-back, or the use of functional languages (e.g. lacking "update in place" semantics).  As the ScalPL book explains, the default execution model can be about as efficient as a standard function call, and implemented in your favorite imperative language.

I will be happy to receive information from, or engage in discussion with, those who find my observations here unfounded.  I find the Actors model an important step in the progress of our knowledge of parallelism, in part because it can be used to describe so many traditional message-passing programs which possess its basic attributes.  (The same cannot be said for ScalPL:  It does not pretend to explain existing programs developed without knowledge of the model.)  That said, I do hope that today's increased hardware technology and available parallelism are matched by advances in programming technologies.

Tuesday, September 10, 2013

Scalable Planning (ScalPL) Book Price Reduced

I figure the early early adopters have had their chance, so I've reduced the price of my book, "Scalable Planning: Concurrent Action from Elemental Thinking," from $65 to $45 for the later early adopters!  And if you use the 20% discount code available at elepar.com (before it goes away), that's actually more like $36 -- about half the original price.  Plenty more info about the book on this blog and at the elepar website.  Questions and feedback are welcome.

Monday, June 03, 2013

Video of ScalPL presentation to the Zissou Society

The motto "Keep Portland Weird" even applies to computer languages, if "weird" can mean "eclectic" -- just witness The Zissou Society for Programming Language Exploration... or pdxlang for short.  How eclectic? Well, Haskell would be considered practically a mainstream language here in Portland, so probably wouldn't qualify.  Anyway, ScalPL (as described on this blog and the Scalable Planning bookwas deemed weird... um, I mean eclectic enough to merit a presentation (immediately following a presentation on m4) last Thursday.  And here is the screen capture of that presentation.

This was my first attempt at screen cap, so it's not perfect.  Also, due to time constraints, the OO concepts at the end aren't especially well covered:  Look for another vid later to do a better job on that.  But if you are looking for a pretty thorough intro on what ScalPL is all about, here's a 72-minute presentation, followed by a 10-minute Q&A.  (No significant background in parallel processing is assumed.)


Feedback here is welcome.

Thursday, March 21, 2013

ScalPL and the Sequential Prison

A former colleague (and his associates) recently referred me to this video/slide presentation:  Ivan Sutherland, The Sequential Prison, at the SPLASH 2011 conference.  I met Dr. Sutherland once some months ago at an alumni event at Portland State University, and understood that there was an overlap in our work, but this presentation provided an opportunity to understand how much.

The talk contains few very concrete or completely novel recommendations, but even it its generality, it's worth a listen, and it is clear that the speaker and I are on the same wavelength more often than not.  He discusses his intuition that the form of our traditional human and computer languages, together with specific words in them, affect our approach to problem solving as a sequential endeavor, and that a more picture-oriented language would help to avoid this "sequential prison" that we've locked ourselves into.  When questioned later about functional languages, he mentions how much they seem to be built around the concept of sequences, and how this also affects the tendency to build and process things serially.

As might be apparent from my work described on this blog, I agree with all of that.  Even things we don't agree on exactly, we agree on vaguely.  For example, in the parallel (or concurrent) world, he prefers the term "configuration" over "program".  While I, also, tend to avoid terms like program and algorithm as being too sequential in heritage, I use the term "plan" in general and "strategy" as a more specific form of plan.  Ivan seems to suggest that picture-oriented languages might require multiple simple forms to convey a single concept in enough specificity to facilitate execution, while my work in ScalPL concentrates on a single, rather technically annotated form, in part as an alternative to the complexities I've seen in the multi-view forms of UML and the like.  Whether ScalPL (or more specifically, my ScalPL diagrams) are meaningful and easy to read, whether they would "leap into" Dr. Sutherland's mind as he requests, I can't say, but I do think they capture the spirit:  Pipelines do look like pipelines, concurrency does look like concurrency -- to me, anyway -- as he requests.  Other representations (such as he requests) are (can be) assembled by the planning tools.


His discussion starting at 30 minutes or so is very descriptive of ScalPL, though he seems to be addressing primarily hardware.  In fact, the sort on page 279 of my Scalable Planning book (here, to the left) seems (to me) to be exactly the sort that he seems to describe there, but where he describes it as an instruction, it is here a ScalPL strategy.  The first question (in Q&A near the end of the video) relates to the difference between hardware and software, with the old impression that the software and hardware are different beasts, that software is too dynamic, too fluid, too soft, to draw the same way we do with hardware. I have not found that to be the slightest problem in my work.

Of course, his discussion of standardization is also well taken.  ScalPL is available.



Wednesday, March 20, 2013

How ex-treme is exa-scale?

[I drafted this some months ago, but for some reason didn't post it. It still looks good to me, so here goes.]

At the SC12 conference in November, I attended a couple of BoFs (Birds of a Feather meetings) regarding exascale. The first, an Open Community Runtime (OCR) for Exascale, was helpful in providing pointers, such as to this website which in turn points to slides, which in turn led me to sites for Intel's Concurrent Collectives (CnC) (including UTK) and University of Delaware Codelets/Runnemede work. But I found myself wincing at some of the things said during the BoF, and after looking at these sites and documents, I admit to being disappointed and, frankly, frustrated. This perhaps became even more pronounced in the second BoF, Resilience for Extreme-scale High-performance Computing. (In hindsight, there are others I wish I had caught.)

First, I would contend that even if we were to come up with a design for The Correct Programming Model(tm) and The Correct Runtime Support(tm) today, it could still realistically take until 2020* or so to develop the appropriate software, support, migration, software engineering techniques (debugging, tracing, patterns, methodologies, libraries, etc.) to make them useful. So there's no time to lose. And from what I was seeing, these groups are not only a long way from having The Correct design, in some ways they seem to be actually going back into history. More specifics about that in an upcoming blog entry.

(*As I write this, I find that the 2020 goal may be more like 2022, but my argument still holds.)

Second, in order to come up with The Correct design, we must understand the critical constraints, not saddle ourselves with non-critical ones and unwarranted assumptions. This is where people like Seymour Cray (starting with "a clean sheet of paper") and Steve Jobs (whether considered a visionary, big thinker, or "tweaker") excelled. Instead, in these meetings, I was hearing "requirements" such as the ability to run MPI programs effectively, or the desirability (necessity?) to use checkpoint/restart to achieve resilience. Even if backwards compatibility may be desirable in some cases, and existing practice may be a useful jumping off point, we will get nowhere by using them to guide the design.


Third, but strongly related to the above, it struck me that I was seeing lots of familiar faces in these BoFs, mostly from my days (a decade+ ago) on the MPI2 Forum. And while reconnecting with former colleagues is part of what the SC conferences are all about, using a standards-based approach is almost certainly not the best way to make significant advances in technology. New approaches will be required to achieve the goals, and standards for those can be established only after those goals are attained. (I could further argue that existing standards have artificially interfered with our progress in this field in other ways, but I'll save that rant for another day, except to say that standards committees are constantly bombarded with the reminder that their job is to standardize existing practice, often while maintaining compatibility with previous versions of the standard. That's not the recipe for quantum leaps!)

So what is the most effective way to achieve the exascale software goals? One might take a hint from Seymour Cray, in the link above: "Shunning committees, he felt that the best computers were the ones where a single architect offered a unified vision. After the machine had been delivered, it was then appropriate, Mr. Cray felt, to listen to feedback from customers and, if necessary, start over from 'a clean sheet of paper.'" I claim that there is no reason to assume that the same wouldn't also be true for software.

In other words, try some big ideas, in multiple (at least two) stages, assume that you may not succeed in the early (e.g. first) attempt, but will learn a lot nonetheless, and integrate what you learn into later stages. That is, in fact, how we fashioned the strategy for NASA's Information Power Grid (back when I had some influence there)... before that plan was apparently revised (and discarded?).

Of course, it could be argued that this "think big with master architect" approach is precisely that which was applied in DARPA's HPCS program, by funding individual companies and allowing them full control over their product/proposal. It could also be argued that that program (has) had limited success (though I would hardly call it a failure). And to those arguments, I would counter that the goals that were not conservative were underspecified. There was little advantage to the participants adding more objectives than those provided by the sponsors, and in fact, disadvantages, in that their competitors for the funds would then be working toward simpler goals.

If failure is not an option, then the goals are too conservative. Consider the manned lunar program: HUGE goals, and a really concrete way to measure if we'd achieved them or not. We were proud to meet them, but it was by no means assured. The exascale project has set some very high hurdles regarding power consumption/energy and reliability/resilience, which seem to fit into this model, but again, at least the goals I've seen on programmability are either conservative or vague or both. Of course, in a political climate, failure to meet a goal can be painful. And where there is public money, there is (and should be) politics.


And, like the moon landing, the trade-off between expense and benefits of attempting to meet (or actually meeting) the goals is a separate issue from whether the goals can be met at all, and the balance is not necessarily clear-cut. If the US government is the customer, they must decide how much it is worth to them, and other participants must decide if it fits their business (and/or research) model. The manned space program was run from within the government. Such a model seems unlikely today.








How about data movement? (For real, this time.)

In the previous post, I addressed the kinds of inherent constraints that an F-Net (and/or ScalPL) plan imposes (and does not impose) on data movement.  The question then becomes how to turn these constraints into an implementation:  Sooner or later, the decision to move or not move each data item must be made to meet the constraints, and something must make those decisions.

The most basic way to meet the constraints is by assigning all the computations (actons) to one processor, putting all the data (resources) into its local memory, and executing the whole thing sequentially.  Unlike some other models (like most message passing) with semantics that intrinsically impose additional copying and/or data movement, or that require context switching between heavyweight address spaces, this sequential execution of F-Nets requires little if any overhead over a native sequential expression of the same computation, either for data movement/copying or context switching. In F-Nets, if a sub-computation/acton is going to complete at all, it can always be executed to completion before moving on to any other subcomputation/acton, and there is no need to save transient/intermediate state in this case.  So the only need for more heavyweight context switching is when an acton diverges ("infinitely loops") or appears it may do so. (Most message passing requires data copying or movement even in a sequential environment, because both the sender and receiver are assumed to have their own mutable copy of the message after the communication, and may both be required to make progress for the overall computation to progress.)

But in the more common case where we want an F-Net/ScalPL plan to exploit the parallelism of a platform, how best to decide how (and when) to embed the plan (program or strategy) onto that platform?  In the good old days, parallel programmers made several assumptions about both the platform and the algorithm to simplify this embedding process:  We assumed that the platform consisted of a fixed number of heterogeneous, dedicated, fault-free general-purpose processors connected with a predictable, regular communication topology.  We also often modeled the application/plan as a set (and usually a sequence) of parallelized steps ("epochs", phases) such as array operations, where each potentially-concurrent operation in a step took about as long as any other operation in that step to complete.  Based on such assumptions, it was sometimes viable to statically map the computations and their associated data onto the processors in the platform.

Under those same assumptions, an F-Net or ScalPL plan can also be statically embedded on a platform in similar ways.  Rather than assigning resources to processors (as some languages might assign array elements under an "owner computes"-like rule), it makes far more sense to assign the computations themselves -- i.e. actons -- to processors, by annotating their contexts.  Another approach is to assign each control state of each resource to a processor.  In either of these cases, the constraint (implied in the F-Net itself) that the content state of a resource must be colocated with the acton during execution, is assumed, and annotations which suggest conflict to this can be resolved in a number of sensible ways (e.g. majority rules).  In ScalPL, by allowing these annotations to include the index values supplied by so-called "dup bindings", such embeddings can be a function of the array elements processed, so are no less expressive than those in HPF, etc.

Even so, just as can crop up in deep subroutine calls in HPF, the hierarchical and dynamic nature of ScalPL means that different (perhaps recursive) activations of the same plan (strategy, program) may be best assigned to different parts of the platform, yet preferably related to the embedding of their parent activation in sensible ways -- so static labeling of contexts isn't necessarily sufficient.  This can be at least partially addressed by a simplified parameter passing -- i.e. providing special (integer) variables in child plans which are set equal to the annotation of the context in the parent, and which the child can then use (e.g. in mathematical expressions) when annotating its own contexts.

Even better is to consider representing the overall embedding as a two-step process, the first as mapping the computations to an abstract platform, which is considered to have easily-understood internode properties, and the second as mapping that abstract platform to a concrete/actual one to hopefully preserve as many of those properties as possible.  My work has focused on an abstract platform (AP) representation of nodes addressed by finite sequences of non-negative integers, allowing the AP topology to be interpreted in a number of ways -- e.g. as a sequence/cycle (e.g. if just single integers), or a rectangular grid (e.g. if tuples of integers), or a tree (e.g. with zero-length sequence representing the root, any other sequence representing a descendant of all nodes addressed by its prefix sequences).  The annotation/parameter passing described in the previous paragraph can now be of these sequences, instead of integers, with the child able to append new suffixes to the sequence it was passed.  Mapping of the AP to the actual platform can be manual or automated, based on various assumptions. A common shorthand is to assign all nodes having the same prefix to a particular actual processor, which can be interpreted as assigning an entire row or column (if the AP is considered a grid) or subtree (if the AP is considered a tree) to that node.  (People who are accustomed to HPF and similar embedding languages may feel it lacks constructs for transpose, etc., but (a) these would be handled in embedding to the AP, not mapping the AP to the actual architecture, and (b) such a transpose, if necessary at all, is imposed by mapping the pre-transpose computations and post-transpose computations to appropriate nodes, since mappings effectively change between steps rather than within them.)

But that leaves two important points.  The first is that the traditional simplifying assumptions for parallel static embedding (summarized in the third paragraph, above) are virtually all unrealistic these days.  Applications for concurrency are no longer confined to scientific and/or matrix operations, so their concurrency is no longer regular in shape or uniform in subcomputation duration.  And this irregularity, together with increased concurrency, makes it less practical to devote entire machines (or portions thereof) to specific applications, so node sharing (yielding variable loads) must also be accommodated.  And as platforms grow ever larger, it becomes less practical to make them heterogeneous and fault-free. The bottom line is that static embedding becomes all but impractical:  These embedding decisions must be made at runtime, in real time.  Using the two-step process mentioned above, only the mapping of the AP to the concrete platform must be dynamically determined.

And the second point, related to the first, is that making such dynamic embedding decisions on a fine-grained basis is also impractical.  This is another reason for facilitating non-trivial, large-granularity subcomputations as the basis for embedding.  And that, in turn, requires that these subcomputations be  more flexible and expressive than traditional data flow-- e.g. allowed to produce multiple results (to each potentially be directed to different downstream subcomputations), and results which affect only parts of larger data structures.  This is a prime motivation behind the structural properties which have been endowed upon actons (the subcomputations) in F-Nets/ScalPL.  Because these actons are typically entire subcomputations/subprograms in themselves, the embedding decisions within an acton to a single processor (to memory, registers, logical units, etc.) to carry them out can be performed statically and locally, such as by compiler, or if necessary, by OoO scheduling within the processor. These local, lightweight decisions are, by now, well understood.

So that is the answer to how data movement is determined within an F-Nets/ScalPL plan (i.e. strategy, potentially consisting of sub-strategies, ultimately down to actons):  This balance, between explicit parameterized embedding of a strategy/plan to an Abstract Platform, dynamic (or, in restricted cases, static) embedding of the AP to a concrete platform, and traditional compiler/processor scheduling of processor resources for each acton within the strategy/plan.  The book goes into some depth in describing an entity known as The Director, and how it can efficiently determine when to move data. There is also significant discussion on how to optimize the movement of data depending on architecture type (e.g. shared vs distributed memory) and the form of the strategy (e.g. with multiple readers of a single resource, readers and writers, etc.)