POOMA post-2.4 parallel model

CONTEXT: In discussing the parallelism models that POOMA supports, the word `context' has a particular meaning. For our purposes, a context contains a single address space in memory, and one or more threads that view that same address space. In the currently supported parallelism models, each context contains exactly one parse thread that executes the user's code. Note that a context does not refer to a particular component of hardware. You can run a code with multiple contexts on a single shared-memory multiprocessor. The threads running in a particular context may be moved from processor to processor on some machines. On the other hand, a single context could not consist of threads running on processors that don't see the same memory.

The parallelism model supported by POOMA 2.0 was a threaded model operating on a single context. (This parallelism model is supported by the SMARTS run-time system.) One version of the user's code runs in the parse thread, and data-parallel statements are evaluated using iterates that can be executed by other threads that see the same memory space. In the threaded version, all the threads share the data and are capable of accessing any part of an array. POOMA 2.3 continues to support the threaded model, and introduces a second model where multiple copies of the user's code are running in separate contexts, and the data is distributed among them. A messaging layer is used to send data when the user code executing on one context needs part of an array that is owned by another context. (The messaging support is provided by the CHEETAH layer.) POOMA 2.3 does not support a combined model where multiple threads run inside each of multiple contexts. The multiple context evaluators in POOMA were designed keeping the multithreaded evaluators in mind, however, there are thread-safety and message polling issues that must be addressed in the messaging layer. The 2.4 release should address these issues and permit a parallel model with multithreaded contexts that communicate with each other through messages.

After POOMA 2.4 two new models of parallelism are supported. Namely use of OpenMP thread level parallelization, if supported by the compiler, and the use of an available MPI library such as MPICH or a vendor provided implementation. Both models, MPI and OpenMP, may be combined simultaneously if the MPI implementation supports this kind of operation. This is especially useful for clusters of SMP workstations. Those new modes of operation can be specified by the --mpi and --openmp configure switches.

CHEETAH overview:

The Advance Computing Laboratory CHEETAH library is a messaging that supports a set of asynchronous interfaces, such as put, get and remote function invocation. CHEETAH also supports a very limited C++ interface. The CHEETAH library can be built using MPI as the underlying messaging layer, or on top of the MM shared memory library for message passing.

To compile POOMA with CHEETAH, use the --messaging option when configuring. See the one of the files in the scripts directory, such as "scripts/buildPoomaCheetahLinux", for an example of compiling and running a POOMA code using CHEETAH. Typically CHEETAH requires extra command line arguments to specify the messaging library. For example:

mpirun -np 4 mycode -mpi

POOMA wraps a useful set of CHEETAH calls. It is recommended that you use the POOMA versions, since they will work when you compile codes without CHEETAH, and it is possible that the CHEETAH may have to change slightly to accommodate thread issues.

Cross-Context parallel model:

A major difference between the threaded model and the message-passing model available in POOMA 2.3 is the fact that multiple versions of the main program are now running. For example, consider the statement:

std::cout << "My context is: " << Pooma::context() << std::endl;

If your application contained this statement, and you ran it on a cluster of computers, you would see the output with numbers ranging from 0 to one less than the number of computers (assuming standard out actually makes it back to your console somehow). The POOMA Inform class is a useful tool to manage output from multiple contexts.

Where a data-parallel statement appears in your code, both the threaded and message passing model evaluate the expression just once. For example, consider the statement:

a = b + 2 * c;

In the threaded model using SMARTS, the main thread is the only one executing the user's code. The main thread sees this statement and generates iterates to evaluate the expression on different parts of the domains. These iterates can then be run concurrently by other threads. In the message-passing model, each context is running a copy of the user's code, so multiple processors will executed this statement. Each context knows that it is responsible for evaluating a portion of the total domain, however, so each context generates iterates for the local domains and then executes the iterates.

Situations in which the user has to worry about the effect of the parallel model should be very rare. One important issue to keep in mind, is that scalar code runs on every context. In most parallel applications, the time spent computing scalar values is small compared to the array statements, so the duplicate computations don't add significantly to run times. The cross-context model is intended to be a single program with multiple data (SPMD), so scalar computations should result in the same value on all contexts. It is possible to violate this model, but doing so will almost certainly cause problems. You could write this code:

if (Pooma::context() == 3) { a = b + 2 * c; }

This code will almost certainly fail, however, because contexts that may need to communicate values will never see the data-parallel statement.

Remote Engines:

The original engine type introduced in POOMA 2.0, the Brick engine, will cause the same data to be allocated and computed on all contexts. This design results in a surprising lack of parallelism. Rather than convert the Brick engine to be a truly parallel object, it was decided to use the Brick engine as a component of a genuinely parallel engine in POOMA 2.3. There may be situations where you wish to store duplicate copies of data on every context, in which case you would continue to use a Brick engine. POOMA 2.3 introduces a Remote engine. Remote engines are wrappers for other block-type engines. A remote engine is only allocated on one context. When the Remote engine is used in computations where the data is needed on another context, a message is sent containing the necessary data. When a Remote engine is used as the patch engine for a MultiPatch engine, the result is an array that is decomposed into blocks which are distributed among different processors. For an example of an array containing multiple patches of remote bricks, see the code in examples/Doof2d. To write truly parallel applications in POOMA, you need to use the MultiPatch array with Remote engines as the patch engine.