If ScalPL is supposed to be expressive enough to capture patterns, then it should be able to capture this one, and this popular use-case can provide some insight into the expressiveness and flexibility of ScalPL. Optimally, the ScalPL plan will accommodate the benefits and ordering of map-reduce just mentioned for a distributed platform/architecture, but will also work on different platforms with different characteristics where it might, perhaps, benefit from more pipelining, less communication, etc. Remember, with a proper runtime system ("Director"), a ScalPL plan is executable -- a program -- rather than just a first step to a final implementation.
Possible High-Level Context for MapReduce |
The blocks (of course) may not be stored or indexed contiguously at some low level, but by the time the MapReduce plan accesses them, they can be made to appear sequential. In the example above, FileBlockIndices is a sequence of indices to the blocks in the file, and FileBlockIndexCount is the number of (or range) of where those indices are stored in FileBlockIndices. FileBlockSize is the maximum number of bytes per block. By using the ScalPL mapping binding modifier (1/1 here) and the anonymous (underscore) role binding from FileBlockIndices, the MapReduce plan just sees (on FileContent) a sequential collection of blocks numbered sequentially within FileBlockIndexCount. We have that plan producing its results to KeyResults, indexed by key, and changing the control state of that key to red. (If a result is not produced for a given key, the control state for that key remains green.)
MapReduce Strategy |
(ERRATA: The binding modifiers, "1", on MapBlock and ReduceKey should technically select the key field from the record, e.g. each be replaced with "1=key".)
That still leaves the $null circle at the bottom of the strategy. One of the difficulties inherent in concurrent planning/programming is determining when an activity has finished, or should finish. While completion of later stages in a sequential plan implies that earlier stages are also complete, concurrent plans like this can also concurrently have data at different stages of processing. Although we could keep any reduction from even beginning until all data had been read (split) and mapped, we want to make this as general and concurrent as possible, so will instead just ensure that a reduction for any particular key won't complete until all the records with that key have been mapped. That's what the $null activity does here: It activates only when all of the file blocks have been fully processed (indicated by BlocksDone) and, for any key, (a) a spot has been reserved for that key's final result (in KeyResults), and (b) there are no records with that key waiting to be reduced (in MappedKeyedRecs). In that case, the $null changes the control state of KeyStatus for that key, indicating to ReduceKey that it can finish reducing that key.
All of the new constructs introduced in the previous blog post ("ScalPL 1.1") are utilized here. All of the arrays except for BlocksDone and FileContent are indexed in at least one dimension by a key. This is strictly illegal (except for integer keys) in the book, but is allowed for keys of any type by #1 of that blog post. The role binding from $null to MappedKeyedRecs is to (effectively) infinite elements, checking whether all of them are empty (green control state), which was illegal by the book, but allowed in #2 of the blog post. And both ReduceKey and $null use the construct just introduced in #3 of the blog post, the open binding form of bindany, to find some key which allows those activities to bind, especially using the control state of KeyResults. (Note that the binding from ReduceKey to KeyResults uses an activation binding, with the @ prefix, as described in the book section 7.3.2, to ensure that the plan will not activate until that role is ready.)
Now to delve into the substrategies, MapBlock and ReduceKey.
MapBlock Strategy |
The $null activity in the lower left activates, and transitions BlockDone, when no other records will be produced (on MappedKeyedRecs) from this file block -- specifically, when (a) BlockPtr is empty by virtue of split hitting the end of the block, (b) RecBoundaries has no records waiting to be mapped, and (c) MappedRecs has no records waiting to be categorized by key.
ReduceKey Strategy |
Plans split, Map, Reduce1, and Reduce2 are not shown here. Most or all may be implemented as tactics, e.g. in a typical computer language, and hopefully there is enough insight provided here to understand how these would simply and naturally be constructed. The last three (at least) would be customized for the problem being solved, so optimally, they would be parameters themselves -- i.e. they would be fed into the plans using roles, and plugged into their respective contexts using $plan roles. I have omitted those details here to focus more directly on the flow of the data. (Another way of achieving that focus, using current ScalPL tools, would have been to simply put them on a separate drawing layer, and then hide or highlight different drawing layers at different times.)
And in a sense, that is the entire point of this post. These plans/strategies show how these four simple data transformations fit together with appropriate communications to create the Map-Reduce paradigm. It is not just a high-level hand-waving approximate solution, it is an actual executable solution, given an appropriate run-time system.
What has not been discussed here is specific communication paradigms or mechanisms to make this all work -- something which many would consider central to the whole Map-Reduce paradigm. The idea is that, as discussed in 5.3.1 of the book, on some particular architecture, the planner (or other helper) would annotate the contexts and/or resources in these strategies to denote where the activities and/or content state would reside at various times. For example, in a distributed architecture, the contexts in the MapReduce strategy above would be annotated, each MapBlock context replicant tied to the block (i.e. block index) which it processes, and the ReduceKey context location computed based on the key which it processes. The communication, then, would occur as the content state of MappedKeyedRecs moves between these embeddings. In the best case, that communication strategy could be optimized automatically, but some human guidance/hints would not be out of the question. The important point here, however, is that these implementation details (via these annotations) occur separate from the raw algorithms shown here, which are independent of platform.
No comments:
Post a Comment