(This replaces two former posts with this approximate title, which have been merged and cleaned up here.) [Errata: In first diagram (Cholesky strategy), the dot next to "Altered" in lower left corner should be yellow instead of red. And in third diagram (UpdateRow strategy), it should have been mentioned that the binding modifier rules were omitted because they are all the same: "1" (i.e. reduction in the first dimension).]
In this post, I'll address an implementation of tiled Cholesky factorization using ScalPL. I've chosen this problem specifically because it is often used as a "real world" example when demonstrating various parallel programming models or languages. In that vein, two references which seem especially clear (and which may serve for future comparisons) include a
Concurrent Collections (CnC) implementation (Knobe) and a
NUMA implementation (Jeannot). It should be noted that the latter effectively computes the result in place, while the former apparently doesn't, but since data can be assumed to move between processors in both cases, the notion of being "in place" is not as clear as one might expect. Other useful references include
wikipedia and
Matlab descriptions.
First, the general algorithm in English as I understand it. (If I make mistakes here, I expect them to be minor enough not to lose the overall point, and I entrust commenters to straighten me out: It is surely due to errors in my translation, and not in these original references.) It iterates over the lower diagonal of a symmetric square matrix -- i.e. the elements (i,j) where j (the column) is less than or equal to i (the row), with the assumption that the elements above that diagonal are equal to their "reflected" counterparts below. The matrix is then assumed to be tiled, so that each element of the matrix (array) operated upon by the algorithm actually contains a fixed-size square tile of the original, larger matrix, but this aspect is below the level of detail addressed here.
The algorithm: Each iteration first computes the element (i.e. tile) in upper left corner (1,1) with a routine (or "tactic", in ScalPL) called
cholesky here (
DPOTRF in the
Jeannot paper, which uses the LAPACK kernels). This computation relies only on the previous value of this element to produce the new value. The algorithm then computes each of the other elements in the first column (i.e. directly below that top corner) with a routine called
trisolve here (
DTRSM in the Jeannot paper), updating each element using the new value in 1,1 as well as its own former value. And then, whenever the appropriate inputs are available, the algorithm computes all the other elements (i.e. those not in the first column), where element i,j is updated using its previous value and elements (i,1) and (j,1) from the first column. For non-diagonal elements, routine
update here (
DGEMM in the Jeannot paper) is used for this, and for diagonal elements (for which only one value from the first column is used, since i=j), routine
updateDiag (
DSYRK in Jeannot) is used. When enough data has been computed from this iteration, the next iteration (or in our case, recursion) can begin, which works identically except that the whole computation is shifted diagonally down one, so that element 2,2 from one iteration is used as the new top corner of the matrix in the next iteration, so the matrix it operates on is smaller by one in both dimensions. That means that no element of the first column of any iteration (computed by
cholesky and
trisolve) will ever be accessed by subsequent iterations, so those results can theoretically be used by (passed back to) the caller ASAP, such as if there is more potential concurrency there. Iterating/recursing continues until an iteration operates on a 1x1 matrix, at which point
cholesky is performed on that element and we're done.
A top level ScalPL strategy (i.e. plan or program) to accomplish this is shown below. The matrix of tiles is represented by the resource rectangle at the bottom (labeled
Tiles), with two hash marks on the left to represent the two dimensions, and the invocation of
cholesky,
trisolve, and
Updates are represented by the three circles with those labels. (We'll use the convention that capitalized names like
Update represent strategies, uncapitalized represent tactics.) The circle at the right (which is unlabeled due to having a role bound to, and thus named,
$plan) represents the recursive invocation to the next iteration. The circle at the left (
truncateRange) simply initializes resource
TwoToN with the range of indices of the matrix it is operating upon except for the first (i.e. 2 to N, where N is the matrix size), to provide indices for
trisolve and
Updates. The remaining circle (containing the
$null tactic to the right of
cholesky) is to "short circuit" and return the
cholesky result in the final 1x1 case, since
trisolve and
Updates won't be invoked in that case. The control state of
TwoToN signifies whether this is that final case (in which case it is
Only1Row/red) or the normal case (where it is
AtLeast2Rows/yellow).
|
Cholesky strategy (click to enlarge) |
As is standard, the binding modifiers (arrows within the circles) carry indices (from their tail, or primary role) to the array being indexed (at their head, or secondary role), with the binding rules (within the dashed boxes) denoting which dimensions are being indexed. A new construction of parentheses within a binding rule is being used here (part of "ScalPL 1.1"), which means that a single source index is used for multiple dimensions, so, for example,
cholesky takes the constant 1 on its primary as both the first and second dimension of its secondary, thus binding that secondary role (
Tile) to resource element
Tiles(1,1). Everything else is "by
the book". The "dupall" asterisk on
trisolve's top role (to
TwoToN) replicates that context for each integer in that range 2 to N, then uses each integer in that range as an index (specified by the green binding modifier) to the first dimension of
Tiles for its
Lower role (green arrow/rule), and the constant one for its second dimension (red arrow/rule), thus binding each replicant to a different element of the first column. Similarly, the
Column1 role of
Updates is bound to the first column of
Tiles (rows
TwoToN in green, column 1 in red), and the
Rest role is bound to everything else (rows and columns
TwoToN in black). The binding rule on the rightmost circle/context (i.e. the recursive call to
Cholesky) represents that the first and second dimensions of
Tiles are effectively shifted/translated by 1 (the constant on the primary) so that the next iteration can continue to address the "new" first column as 1, etc.
The main difficulty to be addressed in any solution is in obtaining maximum concurrency while still ensuring that elements are not being accessed/read before or during the time they're updated/written. Specifically, that includes not letting any
trisolves in an iteration access element (1,1) before it is established by
cholesky, or any of the
update routines in an iteration access elements of the first column before they are established by
trisolve, or letting subsequent iterations access elements still being established by this iteration, all while also allowing elements of the first column to be modified by the caller (by virtue of having been returned early) before the
update routines have finished accessing them. Control states in ScalPL are purpose-built for handling this sort of issue, and in this case, those on
Tiles do a good job:
LastIter (green) for elements not yet updated,
Altered (yellow) for those updated but perhaps still needing to be accessed, and returned (red) for passing them back to the parent/activator when finished.
|
Updates strategy |
As indicated by the bindings in the
Cholesky strategy above, the
Updates strategy (at right) gets elements 2 to N of the first column of the
Tiles matrix (which it will only observe/read) as
Column1, and the rest of the matrix to its right (some of which it will update) as
Rest, as well as the range of indices being represented as
TwoToN. The dupall on the
RowNum role replicates the context/circle activating the
UpdateRow strategy once for every row 2 to N, supplying the unique row number to each replicant on
RowNum, and binding the
Row role for each replicant to that row. The
Column1 role for all replicants are bound to the same first column.
|
UpdateRow strategy |
So each replicant of the
UpdateRow strategy (at left) gets the first column (as
Column1), one row of the rest (as
Row), and the row number represented as
RowNum. Tactic
internalIndices initializes
TwoToRowNumMinus1 with (yes) the range 2 to
RowNum-1. The left context/circle activates one replicant of
update for each integer in that range -- i.e. for each element in the row except the last (which is in the diagonal) -- binding role
ToUpdate to that row element, role
Upper to the corresponding element of
Column1, and role
Lower to the element of
Column1 corresponding to the row (
RowNum). The right context/circle activates
updateDiag just once for the last element of the row (in the diagonal), binding
FromColumn to the corresponding element of
Column1 and
ToUpdate to the element of
Row.
These diagrams basically
are the program. All that's left is the coding of the tactics (
cholesky,
trisolve,
update,
updateDiag,
truncateRange, and
internalIndices) in a base language like C or Fortran. The first four are likely simple calls to standard kernels as in the Jeannot paper (e.g.
DPOTRF,
DTRSM,
DGEMM, and
DSYRK, respectively). The last two are a few lines each, e.g. something like:
internalIndices {TwoToRowNumMinus1->beg = 2; TwoToRowNumMinus1->end = *RowNum - 1;}
One of the trickiest aspects of making a truly portable implementation is in deciding what to do with the first column of an iteration. If there is potential concurrency in the calling parent/mainline, there may be some merit to returning that first column to the caller soon after it has been computed by
cholesky and
trisolve, otherwise not. But even if so, how soon after? Since that first column must still be read by
update and
updateDiag steps in the iteration, should return of the column wait until these are done, or should a copy be made for these or for the caller to facilitate an early return? (On a message-passing system, the copy may be made anyway by virtue of communicating.) Or maybe access should be returned to the caller immediately, with a "copy on write" made only if the caller then tries to modify the elements while they are still being accessed by the
update and
updateDiag steps? Regardless of the answers we choose, building them into the algorithm will make it less portable for cases where they would be better answered differently.
Fortunately, these are exactly the issues addressed by carefully established semantics of predictability in ScalPL. In this particular case, in the
Cholesky strategy, we see that the
Column1 role of
Updates is predictable (with read permission and one transition), meaning that once something within
Updates accesses an element, that transition will eventually be issued in
Cholesky for that element, even if
Updates itself doesn't issue it -- and indeed, it never does in this case. This leaves the Director (i.e. runtime) flexibility to wait for any length of time, and/or make copies/snapshots always or on demand/write. These decisions can be made in realtime, or (perhaps in many cases) via hints/annotations by a human planner which do not alter the behavior.
One more minor/technical point. In this example,
Column1 and
TwoToN in
Updates are shown as activation formals, represented with filled triangles. This is primarily to simplify coding by allowing transition back to the initial/green control state, which is not ordinarily allowed for formal resources. This approach does have the slight downside that
Updates cannot activate until all of the elements of the first column are
Altered/yellow. (This also means that the entire
Column1 role is effectively accessed by
Updates as soon as it activates/binds, so predictable transitions will eventually result to all of those elements.) The use of activation formals here could be avoided by adding another control state to
Column1 and
TwoToN, and adding a
$null activity to each to transition from green to that new control state, which would then logically be the new initial control state, as suggested in Figure 9.9 in the book.)
I am personally quite happy with this example. It demonstrates the flexibility of the binding modifiers, even when dealing with triangular arrays, and of the predictability rules in solving very real issues. Overall, I believe the diagrams are highly intuitive, and carry at least as much information as a similar page area of text would. And I believe that this demonstrates advantages of ScalPL over other approaches for scalability, portability, correctness, and formality.