A former colleague (and his associates) recently referred me to this video/slide presentation: Ivan Sutherland, The Sequential Prison, at the SPLASH 2011 conference. I met Dr. Sutherland once some months ago at an alumni event at Portland State University, and understood that there was an overlap in our work, but this presentation provided an opportunity to understand how much.
The talk contains few very concrete or completely novel recommendations, but even it its generality, it's worth a listen, and it is clear that the speaker and I are on the same wavelength more often than not. He discusses his intuition that the form of our traditional human and computer languages, together with specific words in them, affect our approach to problem solving as a sequential endeavor, and that a more picture-oriented language would help to avoid this "sequential prison" that we've locked ourselves into. When questioned later about functional languages, he mentions how much they seem to be built around the concept of sequences, and how this also affects the tendency to build and process things serially.
As might be apparent from my work described on this blog, I agree with all of that. Even things we don't agree on exactly, we agree on vaguely. For example, in the parallel (or concurrent) world, he prefers the term "configuration" over "program". While I, also, tend to avoid terms like program and algorithm as being too sequential in heritage, I use the term "plan" in general and "strategy" as a more specific form of plan. Ivan seems to suggest that picture-oriented languages might require multiple simple forms to convey a single concept in enough specificity to facilitate execution, while my work in ScalPL concentrates on a single, rather technically annotated form, in part as an alternative to the complexities I've seen in the multi-view forms of UML and the like. Whether ScalPL (or more specifically, my ScalPL diagrams) are meaningful and easy to read, whether they would "leap into" Dr. Sutherland's mind as he requests, I can't say, but I do think they capture the spirit: Pipelines do look like pipelines, concurrency does look like concurrency -- to me, anyway -- as he requests. Other representations (such as he requests) are (can be) assembled by the planning tools.
His discussion starting at 30 minutes or so is very descriptive of ScalPL, though he seems to be addressing primarily hardware. In fact, the sort on page 279 of my Scalable Planning book (here, to the left) seems (to me) to be exactly the sort that he seems to describe there, but where he describes it as an instruction, it is here a ScalPL strategy. The first question (in Q&A near the end of the video) relates to the difference between hardware and software, with the old impression that the software and hardware are different beasts, that software is too dynamic, too fluid, too soft, to draw the same way we do with hardware. I have not found that to be the slightest problem in my work.
Of course, his discussion of standardization is also well taken. ScalPL is available.
Methods for engineering correct, portable, performant, scalable software for concurrent environments (HPC, multicore, cloud, grid, distributed)
Thursday, March 21, 2013
Wednesday, March 20, 2013
How ex-treme is exa-scale?
[I drafted this some months ago, but for some reason didn't post it. It still looks good to me, so here goes.]
At the SC12 conference in November, I attended a couple of BoFs (Birds of a Feather meetings) regarding exascale. The first, an Open Community Runtime (OCR) for Exascale, was helpful in providing pointers, such as to this website which in turn points to slides, which in turn led me to sites for Intel's Concurrent Collectives (CnC) (including UTK) and University of Delaware Codelets/Runnemede work. But I found myself wincing at some of the things said during the BoF, and after looking at these sites and documents, I admit to being disappointed and, frankly, frustrated. This perhaps became even more pronounced in the second BoF, Resilience for Extreme-scale High-performance Computing. (In hindsight, there are others I wish I had caught.)
First, I would contend that even if we were to come up with a design for The Correct Programming Model(tm) and The Correct Runtime Support(tm) today, it could still realistically take until 2020* or so to develop the appropriate software, support, migration, software engineering techniques (debugging, tracing, patterns, methodologies, libraries, etc.) to make them useful. So there's no time to lose. And from what I was seeing, these groups are not only a long way from having The Correct design, in some ways they seem to be actually going back into history. More specifics about that in an upcoming blog entry.
(*As I write this, I find that the 2020 goal may be more like 2022, but my argument still holds.)
Second, in order to come up with The Correct design, we must understand the critical constraints, not saddle ourselves with non-critical ones and unwarranted assumptions. This is where people like Seymour Cray (starting with "a clean sheet of paper") and Steve Jobs (whether considered a visionary, big thinker, or "tweaker") excelled. Instead, in these meetings, I was hearing "requirements" such as the ability to run MPI programs effectively, or the desirability (necessity?) to use checkpoint/restart to achieve resilience. Even if backwards compatibility may be desirable in some cases, and existing practice may be a useful jumping off point, we will get nowhere by using them to guide the design.
Third, but strongly related to the above, it struck me that I was seeing lots of familiar faces in these BoFs, mostly from my days (a decade+ ago) on the MPI2 Forum. And while reconnecting with former colleagues is part of what the SC conferences are all about, using a standards-based approach is almost certainly not the best way to make significant advances in technology. New approaches will be required to achieve the goals, and standards for those can be established only after those goals are attained. (I could further argue that existing standards have artificially interfered with our progress in this field in other ways, but I'll save that rant for another day, except to say that standards committees are constantly bombarded with the reminder that their job is to standardize existing practice, often while maintaining compatibility with previous versions of the standard. That's not the recipe for quantum leaps!)
So what is the most effective way to achieve the exascale software goals? One might take a hint from Seymour Cray, in the link above: "Shunning committees, he felt that the best computers were the ones where a single architect offered a unified vision. After the machine had been delivered, it was then appropriate, Mr. Cray felt, to listen to feedback from customers and, if necessary, start over from 'a clean sheet of paper.'" I claim that there is no reason to assume that the same wouldn't also be true for software.
In other words, try some big ideas, in multiple (at least two) stages, assume that you may not succeed in the early (e.g. first) attempt, but will learn a lot nonetheless, and integrate what you learn into later stages. That is, in fact, how we fashioned the strategy for NASA's Information Power Grid (back when I had some influence there)... before that plan was apparently revised (and discarded?).
Of course, it could be argued that this "think big with master architect" approach is precisely that which was applied in DARPA's HPCS program, by funding individual companies and allowing them full control over their product/proposal. It could also be argued that that program (has) had limited success (though I would hardly call it a failure). And to those arguments, I would counter that the goals that were not conservative were underspecified. There was little advantage to the participants adding more objectives than those provided by the sponsors, and in fact, disadvantages, in that their competitors for the funds would then be working toward simpler goals.
If failure is not an option, then the goals are too conservative. Consider the manned lunar program: HUGE goals, and a really concrete way to measure if we'd achieved them or not. We were proud to meet them, but it was by no means assured. The exascale project has set some very high hurdles regarding power consumption/energy and reliability/resilience, which seem to fit into this model, but again, at least the goals I've seen on programmability are either conservative or vague or both. Of course, in a political climate, failure to meet a goal can be painful. And where there is public money, there is (and should be) politics.
And, like the moon landing, the trade-off between expense and benefits of attempting to meet (or actually meeting) the goals is a separate issue from whether the goals can be met at all, and the balance is not necessarily clear-cut. If the US government is the customer, they must decide how much it is worth to them, and other participants must decide if it fits their business (and/or research) model. The manned space program was run from within the government. Such a model seems unlikely today.
At the SC12 conference in November, I attended a couple of BoFs (Birds of a Feather meetings) regarding exascale. The first, an Open Community Runtime (OCR) for Exascale, was helpful in providing pointers, such as to this website which in turn points to slides, which in turn led me to sites for Intel's Concurrent Collectives (CnC) (including UTK) and University of Delaware Codelets/Runnemede work. But I found myself wincing at some of the things said during the BoF, and after looking at these sites and documents, I admit to being disappointed and, frankly, frustrated. This perhaps became even more pronounced in the second BoF, Resilience for Extreme-scale High-performance Computing. (In hindsight, there are others I wish I had caught.)
First, I would contend that even if we were to come up with a design for The Correct Programming Model(tm) and The Correct Runtime Support(tm) today, it could still realistically take until 2020* or so to develop the appropriate software, support, migration, software engineering techniques (debugging, tracing, patterns, methodologies, libraries, etc.) to make them useful. So there's no time to lose. And from what I was seeing, these groups are not only a long way from having The Correct design, in some ways they seem to be actually going back into history. More specifics about that in an upcoming blog entry.
(*As I write this, I find that the 2020 goal may be more like 2022, but my argument still holds.)
Second, in order to come up with The Correct design, we must understand the critical constraints, not saddle ourselves with non-critical ones and unwarranted assumptions. This is where people like Seymour Cray (starting with "a clean sheet of paper") and Steve Jobs (whether considered a visionary, big thinker, or "tweaker") excelled. Instead, in these meetings, I was hearing "requirements" such as the ability to run MPI programs effectively, or the desirability (necessity?) to use checkpoint/restart to achieve resilience. Even if backwards compatibility may be desirable in some cases, and existing practice may be a useful jumping off point, we will get nowhere by using them to guide the design.
Third, but strongly related to the above, it struck me that I was seeing lots of familiar faces in these BoFs, mostly from my days (a decade+ ago) on the MPI2 Forum. And while reconnecting with former colleagues is part of what the SC conferences are all about, using a standards-based approach is almost certainly not the best way to make significant advances in technology. New approaches will be required to achieve the goals, and standards for those can be established only after those goals are attained. (I could further argue that existing standards have artificially interfered with our progress in this field in other ways, but I'll save that rant for another day, except to say that standards committees are constantly bombarded with the reminder that their job is to standardize existing practice, often while maintaining compatibility with previous versions of the standard. That's not the recipe for quantum leaps!)
So what is the most effective way to achieve the exascale software goals? One might take a hint from Seymour Cray, in the link above: "Shunning committees, he felt that the best computers were the ones where a single architect offered a unified vision. After the machine had been delivered, it was then appropriate, Mr. Cray felt, to listen to feedback from customers and, if necessary, start over from 'a clean sheet of paper.'" I claim that there is no reason to assume that the same wouldn't also be true for software.
In other words, try some big ideas, in multiple (at least two) stages, assume that you may not succeed in the early (e.g. first) attempt, but will learn a lot nonetheless, and integrate what you learn into later stages. That is, in fact, how we fashioned the strategy for NASA's Information Power Grid (back when I had some influence there)... before that plan was apparently revised (and discarded?).
Of course, it could be argued that this "think big with master architect" approach is precisely that which was applied in DARPA's HPCS program, by funding individual companies and allowing them full control over their product/proposal. It could also be argued that that program (has) had limited success (though I would hardly call it a failure). And to those arguments, I would counter that the goals that were not conservative were underspecified. There was little advantage to the participants adding more objectives than those provided by the sponsors, and in fact, disadvantages, in that their competitors for the funds would then be working toward simpler goals.
If failure is not an option, then the goals are too conservative. Consider the manned lunar program: HUGE goals, and a really concrete way to measure if we'd achieved them or not. We were proud to meet them, but it was by no means assured. The exascale project has set some very high hurdles regarding power consumption/energy and reliability/resilience, which seem to fit into this model, but again, at least the goals I've seen on programmability are either conservative or vague or both. Of course, in a political climate, failure to meet a goal can be painful. And where there is public money, there is (and should be) politics.
And, like the moon landing, the trade-off between expense and benefits of attempting to meet (or actually meeting) the goals is a separate issue from whether the goals can be met at all, and the balance is not necessarily clear-cut. If the US government is the customer, they must decide how much it is worth to them, and other participants must decide if it fits their business (and/or research) model. The manned space program was run from within the government. Such a model seems unlikely today.
How about data movement? (For real, this time.)
In the previous post, I addressed the kinds of inherent constraints that an F-Net (and/or ScalPL) plan imposes (and does not impose) on data movement. The question then becomes how to turn these constraints into an implementation: Sooner or later, the decision to move or not move each data item must be made to meet the constraints, and something must make those decisions.
The most basic way to meet the constraints is by assigning all the computations (actons) to one processor, putting all the data (resources) into its local memory, and executing the whole thing sequentially. Unlike some other models (like most message passing) with semantics that intrinsically impose additional copying and/or data movement, or that require context switching between heavyweight address spaces, this sequential execution of F-Nets requires little if any overhead over a native sequential expression of the same computation, either for data movement/copying or context switching. In F-Nets, if a sub-computation/acton is going to complete at all, it can always be executed to completion before moving on to any other subcomputation/acton, and there is no need to save transient/intermediate state in this case. So the only need for more heavyweight context switching is when an acton diverges ("infinitely loops") or appears it may do so. (Most message passing requires data copying or movement even in a sequential environment, because both the sender and receiver are assumed to have their own mutable copy of the message after the communication, and may both be required to make progress for the overall computation to progress.)
But in the more common case where we want an F-Net/ScalPL plan to exploit the parallelism of a platform, how best to decide how (and when) to embed the plan (program or strategy) onto that platform? In the good old days, parallel programmers made several assumptions about both the platform and the algorithm to simplify this embedding process: We assumed that the platform consisted of a fixed number of heterogeneous, dedicated, fault-free general-purpose processors connected with a predictable, regular communication topology. We also often modeled the application/plan as a set (and usually a sequence) of parallelized steps ("epochs", phases) such as array operations, where each potentially-concurrent operation in a step took about as long as any other operation in that step to complete. Based on such assumptions, it was sometimes viable to statically map the computations and their associated data onto the processors in the platform.
Under those same assumptions, an F-Net or ScalPL plan can also be statically embedded on a platform in similar ways. Rather than assigning resources to processors (as some languages might assign array elements under an "owner computes"-like rule), it makes far more sense to assign the computations themselves -- i.e. actons -- to processors, by annotating their contexts. Another approach is to assign each control state of each resource to a processor. In either of these cases, the constraint (implied in the F-Net itself) that the content state of a resource must be colocated with the acton during execution, is assumed, and annotations which suggest conflict to this can be resolved in a number of sensible ways (e.g. majority rules). In ScalPL, by allowing these annotations to include the index values supplied by so-called "dup bindings", such embeddings can be a function of the array elements processed, so are no less expressive than those in HPF, etc.
Even so, just as can crop up in deep subroutine calls in HPF, the hierarchical and dynamic nature of ScalPL means that different (perhaps recursive) activations of the same plan (strategy, program) may be best assigned to different parts of the platform, yet preferably related to the embedding of their parent activation in sensible ways -- so static labeling of contexts isn't necessarily sufficient. This can be at least partially addressed by a simplified parameter passing -- i.e. providing special (integer) variables in child plans which are set equal to the annotation of the context in the parent, and which the child can then use (e.g. in mathematical expressions) when annotating its own contexts.
Even better is to consider representing the overall embedding as a two-step process, the first as mapping the computations to an abstract platform, which is considered to have easily-understood internode properties, and the second as mapping that abstract platform to a concrete/actual one to hopefully preserve as many of those properties as possible. My work has focused on an abstract platform (AP) representation of nodes addressed by finite sequences of non-negative integers, allowing the AP topology to be interpreted in a number of ways -- e.g. as a sequence/cycle (e.g. if just single integers), or a rectangular grid (e.g. if tuples of integers), or a tree (e.g. with zero-length sequence representing the root, any other sequence representing a descendant of all nodes addressed by its prefix sequences). The annotation/parameter passing described in the previous paragraph can now be of these sequences, instead of integers, with the child able to append new suffixes to the sequence it was passed. Mapping of the AP to the actual platform can be manual or automated, based on various assumptions. A common shorthand is to assign all nodes having the same prefix to a particular actual processor, which can be interpreted as assigning an entire row or column (if the AP is considered a grid) or subtree (if the AP is considered a tree) to that node. (People who are accustomed to HPF and similar embedding languages may feel it lacks constructs for transpose, etc., but (a) these would be handled in embedding to the AP, not mapping the AP to the actual architecture, and (b) such a transpose, if necessary at all, is imposed by mapping the pre-transpose computations and post-transpose computations to appropriate nodes, since mappings effectively change between steps rather than within them.)
But that leaves two important points. The first is that the traditional simplifying assumptions for parallel static embedding (summarized in the third paragraph, above) are virtually all unrealistic these days. Applications for concurrency are no longer confined to scientific and/or matrix operations, so their concurrency is no longer regular in shape or uniform in subcomputation duration. And this irregularity, together with increased concurrency, makes it less practical to devote entire machines (or portions thereof) to specific applications, so node sharing (yielding variable loads) must also be accommodated. And as platforms grow ever larger, it becomes less practical to make them heterogeneous and fault-free. The bottom line is that static embedding becomes all but impractical: These embedding decisions must be made at runtime, in real time. Using the two-step process mentioned above, only the mapping of the AP to the concrete platform must be dynamically determined.
And the second point, related to the first, is that making such dynamic embedding decisions on a fine-grained basis is also impractical. This is another reason for facilitating non-trivial, large-granularity subcomputations as the basis for embedding. And that, in turn, requires that these subcomputations be more flexible and expressive than traditional data flow-- e.g. allowed to produce multiple results (to each potentially be directed to different downstream subcomputations), and results which affect only parts of larger data structures. This is a prime motivation behind the structural properties which have been endowed upon actons (the subcomputations) in F-Nets/ScalPL. Because these actons are typically entire subcomputations/subprograms in themselves, the embedding decisions within an acton to a single processor (to memory, registers, logical units, etc.) to carry them out can be performed statically and locally, such as by compiler, or if necessary, by OoO scheduling within the processor. These local, lightweight decisions are, by now, well understood.
So that is the answer to how data movement is determined within an F-Nets/ScalPL plan (i.e. strategy, potentially consisting of sub-strategies, ultimately down to actons): This balance, between explicit parameterized embedding of a strategy/plan to an Abstract Platform, dynamic (or, in restricted cases, static) embedding of the AP to a concrete platform, and traditional compiler/processor scheduling of processor resources for each acton within the strategy/plan. The book goes into some depth in describing an entity known as The Director, and how it can efficiently determine when to move data. There is also significant discussion on how to optimize the movement of data depending on architecture type (e.g. shared vs distributed memory) and the form of the strategy (e.g. with multiple readers of a single resource, readers and writers, etc.)
The most basic way to meet the constraints is by assigning all the computations (actons) to one processor, putting all the data (resources) into its local memory, and executing the whole thing sequentially. Unlike some other models (like most message passing) with semantics that intrinsically impose additional copying and/or data movement, or that require context switching between heavyweight address spaces, this sequential execution of F-Nets requires little if any overhead over a native sequential expression of the same computation, either for data movement/copying or context switching. In F-Nets, if a sub-computation/acton is going to complete at all, it can always be executed to completion before moving on to any other subcomputation/acton, and there is no need to save transient/intermediate state in this case. So the only need for more heavyweight context switching is when an acton diverges ("infinitely loops") or appears it may do so. (Most message passing requires data copying or movement even in a sequential environment, because both the sender and receiver are assumed to have their own mutable copy of the message after the communication, and may both be required to make progress for the overall computation to progress.)
But in the more common case where we want an F-Net/ScalPL plan to exploit the parallelism of a platform, how best to decide how (and when) to embed the plan (program or strategy) onto that platform? In the good old days, parallel programmers made several assumptions about both the platform and the algorithm to simplify this embedding process: We assumed that the platform consisted of a fixed number of heterogeneous, dedicated, fault-free general-purpose processors connected with a predictable, regular communication topology. We also often modeled the application/plan as a set (and usually a sequence) of parallelized steps ("epochs", phases) such as array operations, where each potentially-concurrent operation in a step took about as long as any other operation in that step to complete. Based on such assumptions, it was sometimes viable to statically map the computations and their associated data onto the processors in the platform.
Under those same assumptions, an F-Net or ScalPL plan can also be statically embedded on a platform in similar ways. Rather than assigning resources to processors (as some languages might assign array elements under an "owner computes"-like rule), it makes far more sense to assign the computations themselves -- i.e. actons -- to processors, by annotating their contexts. Another approach is to assign each control state of each resource to a processor. In either of these cases, the constraint (implied in the F-Net itself) that the content state of a resource must be colocated with the acton during execution, is assumed, and annotations which suggest conflict to this can be resolved in a number of sensible ways (e.g. majority rules). In ScalPL, by allowing these annotations to include the index values supplied by so-called "dup bindings", such embeddings can be a function of the array elements processed, so are no less expressive than those in HPF, etc.
Even so, just as can crop up in deep subroutine calls in HPF, the hierarchical and dynamic nature of ScalPL means that different (perhaps recursive) activations of the same plan (strategy, program) may be best assigned to different parts of the platform, yet preferably related to the embedding of their parent activation in sensible ways -- so static labeling of contexts isn't necessarily sufficient. This can be at least partially addressed by a simplified parameter passing -- i.e. providing special (integer) variables in child plans which are set equal to the annotation of the context in the parent, and which the child can then use (e.g. in mathematical expressions) when annotating its own contexts.
Even better is to consider representing the overall embedding as a two-step process, the first as mapping the computations to an abstract platform, which is considered to have easily-understood internode properties, and the second as mapping that abstract platform to a concrete/actual one to hopefully preserve as many of those properties as possible. My work has focused on an abstract platform (AP) representation of nodes addressed by finite sequences of non-negative integers, allowing the AP topology to be interpreted in a number of ways -- e.g. as a sequence/cycle (e.g. if just single integers), or a rectangular grid (e.g. if tuples of integers), or a tree (e.g. with zero-length sequence representing the root, any other sequence representing a descendant of all nodes addressed by its prefix sequences). The annotation/parameter passing described in the previous paragraph can now be of these sequences, instead of integers, with the child able to append new suffixes to the sequence it was passed. Mapping of the AP to the actual platform can be manual or automated, based on various assumptions. A common shorthand is to assign all nodes having the same prefix to a particular actual processor, which can be interpreted as assigning an entire row or column (if the AP is considered a grid) or subtree (if the AP is considered a tree) to that node. (People who are accustomed to HPF and similar embedding languages may feel it lacks constructs for transpose, etc., but (a) these would be handled in embedding to the AP, not mapping the AP to the actual architecture, and (b) such a transpose, if necessary at all, is imposed by mapping the pre-transpose computations and post-transpose computations to appropriate nodes, since mappings effectively change between steps rather than within them.)
But that leaves two important points. The first is that the traditional simplifying assumptions for parallel static embedding (summarized in the third paragraph, above) are virtually all unrealistic these days. Applications for concurrency are no longer confined to scientific and/or matrix operations, so their concurrency is no longer regular in shape or uniform in subcomputation duration. And this irregularity, together with increased concurrency, makes it less practical to devote entire machines (or portions thereof) to specific applications, so node sharing (yielding variable loads) must also be accommodated. And as platforms grow ever larger, it becomes less practical to make them heterogeneous and fault-free. The bottom line is that static embedding becomes all but impractical: These embedding decisions must be made at runtime, in real time. Using the two-step process mentioned above, only the mapping of the AP to the concrete platform must be dynamically determined.
And the second point, related to the first, is that making such dynamic embedding decisions on a fine-grained basis is also impractical. This is another reason for facilitating non-trivial, large-granularity subcomputations as the basis for embedding. And that, in turn, requires that these subcomputations be more flexible and expressive than traditional data flow-- e.g. allowed to produce multiple results (to each potentially be directed to different downstream subcomputations), and results which affect only parts of larger data structures. This is a prime motivation behind the structural properties which have been endowed upon actons (the subcomputations) in F-Nets/ScalPL. Because these actons are typically entire subcomputations/subprograms in themselves, the embedding decisions within an acton to a single processor (to memory, registers, logical units, etc.) to carry them out can be performed statically and locally, such as by compiler, or if necessary, by OoO scheduling within the processor. These local, lightweight decisions are, by now, well understood.
So that is the answer to how data movement is determined within an F-Nets/ScalPL plan (i.e. strategy, potentially consisting of sub-strategies, ultimately down to actons): This balance, between explicit parameterized embedding of a strategy/plan to an Abstract Platform, dynamic (or, in restricted cases, static) embedding of the AP to a concrete platform, and traditional compiler/processor scheduling of processor resources for each acton within the strategy/plan. The book goes into some depth in describing an entity known as The Director, and how it can efficiently determine when to move data. There is also significant discussion on how to optimize the movement of data depending on architecture type (e.g. shared vs distributed memory) and the form of the strategy (e.g. with multiple readers of a single resource, readers and writers, etc.)
Monday, March 11, 2013
What about data movement?
I received a recent question from someone quite familiar with the fields of high-performance computing and data flow, so I figured others might share his same concerns, and answering here might kill more birds than a personal reply. (Just in case there are privacy concerns, I'll not cite his precise name or words.)
It included statements with which I wholeheartedly agree -- e.g. that data movement is very expensive in modern compute environments (e.g. relative to local computation). He says that the F-Net model doesn't say anything about when and where data should move, which he says is required of any modern model.
It is, in fact, a central goal of F-Nets (and ScalPL) to put as few constraints as possible on how and when data might move in order to carry out the specified plan (aka execute the program, evaluate the mapping or calculation) -- but enough constraints so that the plan can be efficiently implemented on different platforms (e.g. shared and distributed memory) by knowing (in advance, when possible) when colocation of data and program/plan fragments must occur. Even the term "data movement" might be confusing here, in that the model doesn't preclude moving parts of the plan itself instead of the data upon which those parts operate -- but those, too, can be considered as data during movement. F-Nets/ScalPL is all about expressing (without overexpressing) those constraints -- i.e. the kinds of access each subcomputation (acton) will need to which data (resources), and (assuming that each subcomputation (acton) is to be statically evaluated without migration) the data and program fragments that need to be colocated for this to occur. It is designed to be neither too much nor too little information.
One can be forgiven for assuming that satisfactory forms of portable algorithmic expression already exist, e.g. functional or dataflow languages. As a brief overview: Languages and models fall broadly into control flow and data flow camps, where the former defines what gets to evaluate next based on a program counter (the functions called or objects invoked/messaged), and the latter defines it based on which data is available to compute with. Functional languages are closely related to dataflow languages, but the interpretation of a variable is slightly different, as simply a shorthand for an entity like a numeric value or subfunction, rather than a dynamic entity which becomes filled/defined or not. Traditional parallel constructs (e.g. messages or shared memory) are often sort of a hybrid between control and data flow, where there are several locusts of control ("threads") that are selectively intentionally blocked when necessary to wait for data to become available (from another thread).
Data flow approaches are traditionally facilitated by single-assignment variables (or arcs) that change state to signify when data is available there for use by downstream calculation. (Lazy and non-strict evaluation can allow evaluation to progress even when some of the potentially-needed data isn't available.) But as indicated by the nomenclature, the variables can be assigned to only once (or simply used/considered as shorthand for another expression), in essence ignoring a central feature of virtually all computing hardware, including those in nature: Mutable memory, the ability to change the contents or character of something stored without changing where it is stored. As a result, when only part of a large datum is being altered in these languages, the entire result must be defined in terms of the entire previous value -- very unlike what needs to actually happen internally for any efficiency. That is, assignment to a data flow variable has nothing much to do with updating storage, but rather with signaling the availability of the data to some part of the program. Because this signaling is integrated into the assignment, it is impossible (or unnatural) to signal availability without assignment, or to assign without signaling availability.
These shortcomings are not always obvious in examples. For example, functions (or algorithms) which are primarily mathematical in nature (e.g. produce a single deterministic result from many arguments), and which don't benefit by partial assignment and/or update in place don't really need anything more that traditional data flow. However, in real world programming, operations often produce multiple results, perhaps each destined for different downstream computations, and each of those results often provides only part of an overall datum, a datum which may be partially or completely updated by several computations during its lifetime -- plus, any particular datum might best be updated on a first-come first-served basis, rather than some deterministic order. These are not so natural for data flow languages and/or approaches.
F-Nets and ScalPL overcome these constraints fairly directly by providing separate constructs for the control state (i.e. accessibility) and the content state (i.e. content) of a variable. The control state is related to the flow of control, and content state, the flow of data, and like traditional data flow models, both are related to the variable (called a resource), rather than to an independent control flow. By allowing manipulation of control state and content state explicitly and independently, flexibility, efficiency, and portability ensue.
It does introduce some complications. Can a single construct still really represent either broadcast (or multicast) and shared readers? How about buffering on distributed systems while avoiding extra overhead of queuing in shared memory? Read and write locks? (Can making an extra copy be made semantically and syntactically identical to blocking writers waiting for readers to finish?) And all in the presence of non-terminating functions and Turing undecidability? And while hiding latency on platforms and applications where that applies? It takes some finesse in definitions, but it can all be done, and these fine points are the essence of F-Nets.
OK, fine, even assuming I've convinced you that F-Nets/ScalPL is a great portable representation, it still begs the original question: How, when, and by what, does the issue of data movement actually get settled? I'll address some of that in the next entry. (A more complete explanation is in chapters 4 and 5 of the book.)
It included statements with which I wholeheartedly agree -- e.g. that data movement is very expensive in modern compute environments (e.g. relative to local computation). He says that the F-Net model doesn't say anything about when and where data should move, which he says is required of any modern model.
It is, in fact, a central goal of F-Nets (and ScalPL) to put as few constraints as possible on how and when data might move in order to carry out the specified plan (aka execute the program, evaluate the mapping or calculation) -- but enough constraints so that the plan can be efficiently implemented on different platforms (e.g. shared and distributed memory) by knowing (in advance, when possible) when colocation of data and program/plan fragments must occur. Even the term "data movement" might be confusing here, in that the model doesn't preclude moving parts of the plan itself instead of the data upon which those parts operate -- but those, too, can be considered as data during movement. F-Nets/ScalPL is all about expressing (without overexpressing) those constraints -- i.e. the kinds of access each subcomputation (acton) will need to which data (resources), and (assuming that each subcomputation (acton) is to be statically evaluated without migration) the data and program fragments that need to be colocated for this to occur. It is designed to be neither too much nor too little information.
One can be forgiven for assuming that satisfactory forms of portable algorithmic expression already exist, e.g. functional or dataflow languages. As a brief overview: Languages and models fall broadly into control flow and data flow camps, where the former defines what gets to evaluate next based on a program counter (the functions called or objects invoked/messaged), and the latter defines it based on which data is available to compute with. Functional languages are closely related to dataflow languages, but the interpretation of a variable is slightly different, as simply a shorthand for an entity like a numeric value or subfunction, rather than a dynamic entity which becomes filled/defined or not. Traditional parallel constructs (e.g. messages or shared memory) are often sort of a hybrid between control and data flow, where there are several locusts of control ("threads") that are selectively intentionally blocked when necessary to wait for data to become available (from another thread).
Data flow approaches are traditionally facilitated by single-assignment variables (or arcs) that change state to signify when data is available there for use by downstream calculation. (Lazy and non-strict evaluation can allow evaluation to progress even when some of the potentially-needed data isn't available.) But as indicated by the nomenclature, the variables can be assigned to only once (or simply used/considered as shorthand for another expression), in essence ignoring a central feature of virtually all computing hardware, including those in nature: Mutable memory, the ability to change the contents or character of something stored without changing where it is stored. As a result, when only part of a large datum is being altered in these languages, the entire result must be defined in terms of the entire previous value -- very unlike what needs to actually happen internally for any efficiency. That is, assignment to a data flow variable has nothing much to do with updating storage, but rather with signaling the availability of the data to some part of the program. Because this signaling is integrated into the assignment, it is impossible (or unnatural) to signal availability without assignment, or to assign without signaling availability.
These shortcomings are not always obvious in examples. For example, functions (or algorithms) which are primarily mathematical in nature (e.g. produce a single deterministic result from many arguments), and which don't benefit by partial assignment and/or update in place don't really need anything more that traditional data flow. However, in real world programming, operations often produce multiple results, perhaps each destined for different downstream computations, and each of those results often provides only part of an overall datum, a datum which may be partially or completely updated by several computations during its lifetime -- plus, any particular datum might best be updated on a first-come first-served basis, rather than some deterministic order. These are not so natural for data flow languages and/or approaches.
F-Nets and ScalPL overcome these constraints fairly directly by providing separate constructs for the control state (i.e. accessibility) and the content state (i.e. content) of a variable. The control state is related to the flow of control, and content state, the flow of data, and like traditional data flow models, both are related to the variable (called a resource), rather than to an independent control flow. By allowing manipulation of control state and content state explicitly and independently, flexibility, efficiency, and portability ensue.
It does introduce some complications. Can a single construct still really represent either broadcast (or multicast) and shared readers? How about buffering on distributed systems while avoiding extra overhead of queuing in shared memory? Read and write locks? (Can making an extra copy be made semantically and syntactically identical to blocking writers waiting for readers to finish?) And all in the presence of non-terminating functions and Turing undecidability? And while hiding latency on platforms and applications where that applies? It takes some finesse in definitions, but it can all be done, and these fine points are the essence of F-Nets.
OK, fine, even assuming I've convinced you that F-Nets/ScalPL is a great portable representation, it still begs the original question: How, when, and by what, does the issue of data movement actually get settled? I'll address some of that in the next entry. (A more complete explanation is in chapters 4 and 5 of the book.)
Subscribe to:
Posts (Atom)