POOMA Tutorial 4
Further Topics

Contents:
    Introduction
    Block And Evaluate
    Small Vectors and Tensors
        A Note on Tensor Accumulators
    Using Multiple Patches
    Guard Layers
    Taking Layout Into Account
    Component Views
    Summary

Introduction

One of POOMA's most attractive features is its high performance on both single-processor and shared-memory multiprocessor machines. As future releases of the library will also support distributed-memory multicomputers and networks of workstations, POOMA's authors have had to think very carefully about how to obtain the best possible performance across a wide range of applications on different architectures.

The heart of the problem POOMA's authors face is that while data-parallel programming is a natural way to express many scientific and numerical algorithms, straightforward implementations of it do exactly the wrong thing on modern RISC architectures, whose performance depends critically on the re-use of data loaded into cache. If a program evaluates A+B+C for three arrays A, B, and C by adding A to B, then adding C to that calculation's result, performance suffers both because of the overhead of executing two loops instead of one, but also (and more importantly) because every value in the temporary array that stores the result of A+B has to be accessed twice: once to write it, and once to read it back in. As soon as this array is too large to fit into cache, the program's performance will drop dramatically.

The first section of this tutorial explains what POOMA does to solve this problem. Subsequent sections discuss other advanced aspects of POOMA, such as reduction functions that will execute efficiently regardless of how arrays are laid out in memory, and the use of traits classes to provide programs with information about POOMA objects.

Block And Evaluate

POOMA tries to resolve the tension between how programmers want to express their ideas, and what runs most efficiently on modern architectures, by delaying the evaluation of expressions until enough is known about their context to ensure that they can be evaluated efficiently. It does this by blocking calculations into units called iterates, and putting those iterates into a queue of work that is still to be done. Each iterate is a portion of a computation, over a portion of a domain. POOMA tracks data dependencies between iterates dynamically to ensure that out-of-order computation cannot occur.

Depending on the switches specified during configuration when the library is installed, and the version of the POOMA library that a program is linked against, POOMA will run in one of four different modes. In the first mode, the work queue doesn't actually exist. Instead, the single thread of execution in the program evaluates iterates as soon as they are "queued" (i.e., work is done immediately). The result is that all of the calculations in a statement are completed by the time execution reaches the semi-colon at the end of that statement.

In its second mode, POOMA maintains the work queue, but all work is done by a single thread. The queue is used because the explicit parceling up of work into iterates gives POOMA an opportunity to re-order or combine those iterates. While the overhead of maintaining and inspecting the work queue can slow down operations on small arrays, it makes operations on large arrays much faster.

For example, consider the four function calls that perform red/black relaxation in the second tutorial. In order to get the highest possible performance out of the cache, all four of these expressions should be evaluated on a given cache block before any of the expressions are evaluated for the next cache block. Managing this by hand is a nightmare, both because cache size varies from machine to machine (even when those machines come from the same manufacturer), and because very slight changes in the dimensions of arrays can tip them in or out of cache. POOMA's array classes and overloaded operators do part of the job by creating appropriately-sized iterates; its work queue does the rest of the job by deciding how best to evaluate them. The net result is higher performance for less programmer effort.

POOMA's third and fourth modes of operation are multi-threaded. Each thread in a pool takes iterates from the work queue when and as they become available. Iterates are evaluated independently; the difference between the two modes is that one is synchronous and blocks after evaluating each data-parallel statement, while the other is asynchronous and permits out-of-order execution. The table below summarizes these four modes, along with the configuration arguments used to produce each.

1. Synchronous Serial

Conventional sequential execution

--serial

2. Asynchronous Serial

Serial work queue

--parallel --sched serialAsync

3. Synchronous Parallel

Multithreaded, blocking after each
data-parallel statement

--parallel --sched sync

4. Asynchronous Parallel

Multithreaded, out-of-order execution

--parallel --sched async

A very important function in POOMA's work allocation system is Pooma::blockAndEvaluate(). It is one of only two functions that expose the library's internal parallelism and cache optimizations to users. While POOMA automatically calls it at the right times in most cases, there are a few situations in which programmers should call it explicitly.

If evaluation has been deferred, the statements being evaluated are not guaranteed to have completed until blockAndEvaluate() is called. POOMA does this by itself inside of operator<<, reductions, and so on, but there is a place where the performance overhead of doing that check would be so high as to be unacceptable: indexing with integers. If blockAndEvaluate() was called inside every use of operator(), it would be impossible to write serial loops efficiently.

This means that when a program is running in modes 2-4 (i.e., using either parallelism or potentially asynchronous execution), it must call blockAndEvaluate() before subscripting arrays with integers. Failure to do so can lead to race conditions, and other hard-to-find errors.

Typical uses of blockAndEvaluate() look like:

Array<2> A(N, N);
A = 0;
Pooma::blockAndEvaluate();
A(N/2, N/2) = 1;

or:

Loc<2> center(N/2, N/2);
Pooma::blockAndEvaluate();
A(center) = 1;

Without the calls, the code might not parallelize correctly. If, however, code like the following is used instead:

Interval<2> center(Interval<1>(N/2, N/2), Interval<1>(N/2, N/2));
A(center) = 1;

then correct execution is guaranteed, because this assignment will be handled using all of POOMA's parallel machinery. Of course, the safe version is somewhat slower, and should not be used inside a time-critical loop, since it would implicitly be doing locking and unlocking on every call.

It can be very tedious to place blockAndEvaluate() calls in code that mixes scalar and data-parallel statements. It is easier and less error-prone to simply turn off asynchronous operation temporarily. This is accomplished by calling Pooma::blockingExpressions(true) at the beginning of such a block and then calling Pooma::blockingExpressions(false) at the end.

Small Vectors and Tensors

POOMA includes two "tiny" classes that are optimized to represent small vectors and tensors. Not surprisingly, these are called Vector and Tensor; their declarations are:

template<int Size, class T = double, class EngineTag = Full>
struct Vector;

template<int Size, class T = double, class EngineTag = Full>
struct Tensor;

The size parameters specify the fixed size(s) of the objects, and are used as follows:

Vector<3> v;         // 3-component vector of doubles.

Vector<2, int> vi;   // 2-component vector of ints.

Tensor<2, int> t;    // 2×2 tensor of ints.

Note that these classes use engine abstractions, just like their grown-up Array counterpart. The only engine class available for Vector in this release is Full, which signals that all elements of the vector or tensor are stored. For Tensor, in addition to Full, POOMA provides Antisymmetric, Symmetric, and Diagonal classes to use for the EngineTag parameter. The names of these classes describe their mathematical meaning. In the following table, we show the definitions of the tensor symmetries, indexing convention, and the way the data values are stored internally in the Tensor<Dim,T,EngineTag> classes. Note that we only store values that cannot be computed from other values, but the user can still index non-Full Tensor objects as if they had all elements stored.

EngineTag Value Tensor Structure (i,j) Indices Array Storage of Elements

Full

| x00 x01 x02 | | x10 x11 x12 | | x20 x21 x22 |

| 0,0 0,1 0,2 | | 1,0 1,1 1,2 | | 2,0 2,1 2,2 |

| x_m[0] x_m[3] x_m[6] | | x_m[1] x_m[4] x_m[7] | | x_m[2] x_m[5] x_m[8] |

Symmetric

| x00 x10 x20 | | x10 x11 x21 | | x20 x21 x22 |

| 0,0 0,1 0,2 | | 1,0 1,1 1,2 | | 2,0 2,1 2,2 |

| x_m[0] | | x_m[1] x_m[3] | | x_m[2] x_m[4] x_m[5] |

Antisymmetric

| 0 -x10 -x20 | | x10 0 -x21 | | x20 x21 0 |

| 0,0 0,1 0,2 | | 1,0 1,1 1,2 | | 2,0 2,1 2,2 |

| | | x_m[0] | | x_m[1] x_m[2] |

Diagonal

| x00 0 0 | | 0 x11 0 | | 0 0 x22 |

| 0,0 0,1 0,2 | | 1,0 1,1 1,2 | | 2,0 2,1 2,2 |

| x_m[0] | | x_m[1] | | x_m[2] |

EngineTag Value	Tensor Structure	(i,j) Indices	Array Storage of Elements
Full	\| x00 x01 x02 \| \| x10 x11 x12 \| \| x20 x21 x22 \|	\| 0,0 0,1 0,2 \| \| 1,0 1,1 1,2 \| \| 2,0 2,1 2,2 \|	\| x_m[0] x_m[3] x_m[6] \| \| x_m[1] x_m[4] x_m[7] \| \| x_m[2] x_m[5] x_m[8] \|
Symmetric	\| x00 x10 x20 \| \| x10 x11 x21 \| \| x20 x21 x22 \|	\| 0,0 0,1 0,2 \| \| 1,0 1,1 1,2 \| \| 2,0 2,1 2,2 \|	\| x_m[0] \| \| x_m[1] x_m[3] \| \| x_m[2] x_m[4] x_m[5] \|
Antisymmetric	\| 0 -x10 -x20 \| \| x10 0 -x21 \| \| x20 x21 0 \|	\| 0,0 0,1 0,2 \| \| 1,0 1,1 1,2 \| \| 2,0 2,1 2,2 \|	\| \| \| x_m[0] \| \| x_m[1] x_m[2] \|
Diagonal	\| x00 0 0 \| \| 0 x11 0 \| \| 0 0 x22 \|	\| 0,0 0,1 0,2 \| \| 1,0 1,1 1,2 \| \| 2,0 2,1 2,2 \|	\| x_m[0] \| \| x_m[1] \| \| x_m[2] \|

The code below (included in the release as examples/Tiny) is a short example of how Vector and Tensor classes can be used:

01 #include "Pooma/Arrays.h"
02
03 int main(
04      int                 argc,           // argument count
05      char*               argv[]          // argument list
06 ){
07  // Initialize POOMA.
08  Pooma::initialize(argc, argv);
09
10  // Make an array of 100 3D ray vectors.
11  Loc<1> patchSize(25);
12  UniformGridLayout<1> layout(Interval<1>(100), patchSize);
13  Array< 1, Vector<3>, MultiPatch<UniformTag,Brick> > rays(layout);
14  
15  // Set the third component of all of the vectors to zero.
16  rays.comp(2) = 0.0;
17  
18  // Starting some scalar code, must block.
19  Pooma::blockAndEvaluate();
20  
21  // Fill the vectors with a random value for the first component.
22  for (int i = 0; i<100; i++)
23  {
24    rays(i)(0) = rand() / static_cast<double>(RAND_MAX);
25  }
26
27  // Define a unit vector pointing in the y direction.
28  Vector<3> n(0.0, 1.0, 0.0);
29    
30  // Set the second component so that the length is one.
31  rays.comp(1) = sqrt(1.0 - rays.comp(0) * rays.comp(0));
32
33  // Reflect the rays off of a plane perpendicular to the y axis.  
34  rays += -2.0 * dot(rays, n) * n;
35  
36  // Define a diagonal tensor:
37  Tensor<3,double,Diagonal> xyflip2(0.0);
38  xyflip2(0,0) = -2.0; 
39  xyflip2(1,1) = -2.0;
40
41  // Tensor-Vector dot product multiplies x and y components by -2.0:
42  rays = dot(xyflip2, rays);
43 
44  // Output the rays.
45  std::cout << rays << std::endl;
46  
47  // Clean up and leave.
48  Pooma::finalize();
49  return 0;
50 }

As line 13 of this code shows, programs can declare POOMA Arrays with elements other than basic arithmetic types like int or double. In particular, Vector, Tensor, and complex are explicitly supported. Please contact freepooma-devel@nongnu.org for information on using other, more complicated types.

The Array::comp() method used on line 16 does component forwarding. The expression rays.comp(2) returns an Array<double> that supports writing into the second component of each vector element of rays. This is a data-parallel statement that works in a way analogous to the loop at lines 22-25, except that the POOMA evaluator will calculate patches in parallel. Thus, if a program had an array of tensors T, it could change the element in the 0th row, 1st column with T.comp(0, 1). Note that, unlike Array, both Vector and Tensor always index from zero. Component forwarding is intimately related to the notion of component views, which are discussed below.

Line 24 shows that, as expected, the i^th component of a Vector V can be accessed for both reading and writing using the syntax V(i); Tensor element access requires two subscripts. Thus, the first subscript in the expression rays(i)(0) returns the i^th element of the Vector rays, while the second subscript returns the zeroth component of that vector.

Line 28 shows that Vectors can be initialized with Size element values. Similarly, instances of Tensor can be initialized with Size*Size element values.

The data-parallel expression on line 31 shows that the usual math functions can be applied to entire arrays. The unary and binary functions supported are:

`acos`	`asin`	`atan`	`ceil`	`cos`	`cosh`
`exp`	`fabs`	`floor`	`log`	`log10`	`sin`
`sinh`	`sqrt`	`tan`	`tanh`	`imag`	`real`
`abs`	`arg`	`norm`¹
`ldexp`	`pow`	`fmod`	`atan2`	`dot`²	`polar`³
1. `complex<T>` only
2. `Vector` and `Tensor` only
3. `complex<T>` only

Line 34 is a data-parallel expression on vectors. In addition to dot product, the normal arithmetic functions involving Vector and Tensor are supported (see the note on tensor accumulators below for exceptions), as are the following named functions on vectors and tensors:

norm(Vector<D,T,E> &v):

Returns a scalar of type T, equal to sqrt(dot(v, v)).

norm2(Vector<D,T,E> &v):

Returns a scalar of type T, equal to dot(v, v).

trace(Tensor<D,T,E> &t):

Returns a scalar of type T, equal to the trace of the tensor t (sum of diagonal elements).

det(Tensor<D,T,E> &t):

Returns a scalar of type T, equal to the determinant of the tensor t.

transpose(Tensor<D,T,E> &t):

Returns a tensor of type Tensor<D,T,E>, equal to the transpose of the tensor (element (i,j) of the transpose is equal to element (j,i) of the input tensor t.

template<class OutputEngineTag, int D, class T, class E> Tensor<D,T,OutputEngineTag> &symmetrize(Tensor<D,T,E> &t):

Returns a tensor of type Tensor<D,T,E>, applying an appropriate symmetrizing operation to convert from the symmetry of the input EngineTag (for example, Full) to the symmetry of the OutputEngineTag (for example, Antisymmetric). This is invoked using explicit template instantiation for the desired OutputEngineTag. For example:

Tensor<2,double,Full> t(1.0, 2.0, 3.0, 4.0);
Tensor<2,double,Antisymmetric> at = symmetrize<Antisymmetric>(t);
std::cout << " t = " << t << std::endl;
std::cout << "at = " << at << std::endl;

produces the output:

 t = ( ( 1 3 )( 2 4 ) )
 a = ( ( 0 0.5 )( -0.5 0 ) )

dot(Vector&, Tensor&):

Returns a Vector via matrix-vector product of the arguments.

dot(Tensor&, Vector&):

Returns a Vector via matrix-vector product of the arguments.

dot(Tensor&, Tensor&):

Returns a Tensor via matrix-matrix product of the two arguments.

outerProduct(Vector&, Vector&):

Returns a Tensor (with EngineTag=Full) via outer (tensor) product of the arguments.

These functions also operate on Arrays of Tensor and Vector elements (and DynamicArrays and Fields).

Lines 37-39 show construction of a diagonal tensor using the Tensor class with Diagonal for the EngineTag parameter; line 37 constructs it with all (diagonal) values equal to 0.0, then lines 38 and 39 assign the first two elements along the diagonal to -2.0. Line 42 illustrates the Tensor-Vector dot product, returning a Vector.

This release of POOMA does not offer double-dot products, cross products or any other vector or tensor operations; these are being considered for future releases.

Finally, as line 45 shows, arrays of vectors can be output like arrays of any other type.

A Note on Tensor Accumulators

Accumulation operators such as operator*=() acting on Tensor<D,T,EngineTag> may result in a Tensor having different symmetry (different EngineTag than what you are accumulating into). For example,

Tensor<2,double,Antisymmetric> t1, t2;
// ... assign values
t1 *= t2;

is incorrect, because the result of multiplying the two antisymmetric tensors would be a symmetric tensor, whose value is impossible to store in the left-hand-side object t1, which is an antisymmetric tensor. For this reason, the only accumulation operators currently defined for Tensor types are operator+=() and operator-=(), which do not change the symmetry. A consequence of this is that the only reduction operator acting on Arrays of Tensor elements which works is the sum() reduction.

Using Multiple Patches

Our next Laplace equation solver uses the class MultiPatch to help POOMA take advantage of whatever parallelism is available. An array with a MultiPatch engine breaks the total domain into a series of blocks. Such an array is defined as follows:

// Define the total domain.
Interval<2> totalDomain(100, 100);

// Define the sizes of the patches (in this case 10×10).
Loc<2> patches(10, 10);

// Create a UniformGridLayout.
UniformGridLayout<2> layout(totalDomain, patches);

// Create the array containing 100 Brick patches, each 10×10.
Array< 2, double, MultiPatch<UniformTag,Brick> > A(layout);

The Interval declaration specifies the total logical domain of the array being created. The 10×10 Loc is then used in the UniformGridLayout declaration to specify that the total domain is to be managed using a total of 100 patches. When the Array A is finally declared, Array's third template parameter is explicitly instantiated using MultiPatch, and the layout object layout is used as a constructor parameter.

Once all of this has been done, A can be used like any other array. However, if a data-parallel expression uses multi-patch arrays, POOMA's evaluator automatically computes values for the patches in parallel. This means that the relaxation program shown below (included in the release as examples/Solvers/UMPJacobi) would be able to take full advantage of multiple processors, if the machine it was being run on had them, but would be equally efficient on a conventional uniprocessor:

01 #include "Pooma/Arrays.h"
02 
03 #include <iostream>
04 
05 const int N = 18; // The size of each side of the domain. 
06 
07 template<class T1, class E1, class T2, class E2>
08 void
09 ApplyJacobi(
10     const Array<2, T1, E1>      & V,	// to be relaxed
11     const ConstArray<2, T2, E2> & b,	// fixed term
12     const Range<2>              & IJ	// region of calculation
13 ){
14     V(IJ) = 0.25 * (V(IJ+Loc<2>(1,0)) + V(IJ+Loc<2>(-1,0)) +
15                     V(IJ+Loc<2>(0,1)) + V(IJ+Loc<2>(0,-1)) - b(IJ));
16 }
17 
18 template<class T1, class E1>
19 void
20 ApplyPeriodic(
21     const Array<2, T1, E1>      & V  // to be wrapped
22 ){
23     // Get the horizontal and vertical extents of the domain.
24     Interval<1> I = V.domain()[0],
25                 J = V.domain()[1];
26 
27     // Copy each of the four slices in turn.
28     V(I.first(), J) = V(I.last()-1, J);
29     V(I.last(),  J) = V(I.first()+1,J);
30     V(I, J.first()) = V(I, J.last()-1);
31     V(I, J.last())  = V(I, J.first()+1);
32 }
33 
34 int main(
35     int                 argc,           // argument count
36     char *              argv[]          // argument vector
37 ){
38     // Initialize POOMA.
39     Pooma::initialize(argc,argv);
40 
41     // The domain with guard cells on the boundary.
42     Interval<2> guarded( Interval<1>(0, N+1), Interval<1>(0, N+1) );
43 
44     // Create the layouts.
45     UniformGridLayout<2> guardedLayout( guarded, Loc<2>(4, 4) );
46 
47     // The array we'll be solving for.
48     Array<2, double, MultiPatch<UniformTag,Brick> > V(guardedLayout);
49     V = 0.0;
50 
51     // The right hand side of the equation.
52     Array<2, double, MultiPatch<UniformTag,Brick> > b(guardedLayout);
53     b = 0.0;
54     
55     // Must block since we're doing some scalar code here (see Tutorial 4).
56     Pooma::blockAndEvaluate();
57     b(3*N/4,   N/4) = -1.0;
58     b(  N/4, 3*N/4) =  1.0;
59 
60     // The interior domain, now with stride 2.
61     Range<2> IJ( Range<1>(1, N-1, 2), Range<1>(1, N-1, 2) );
62 
63     // Iterate 200 times.
64     for (int i=0; i<200; ++i)
65     {
66         ApplyJacobi(V, b, IJ);
67         ApplyJacobi(V, b, IJ+Loc<2>(1,1)); 
68         ApplyJacobi(V, b, IJ+Loc<2>(1,0));
69         ApplyJacobi(V, b, IJ+Loc<2>(0,1));
70         ApplyPeriodic(V);
71     }
72 
73     // Print out the result.
74     std::cout << V << std::endl;
75 
76     // Clean up and report success.
77     Pooma::finalize();
78     return 0;
79 }

A program can go one step further, and take advantage of the fact that MultiPatch is itself templated. The first template parameter, LayoutTag, specifies the type of domain decomposition that is done. If UniformTag is specified, then all of the blocks are assumed to have the same size. If GridTag is specified, then the domain decomposition can consist of an array of non-uniform blocks, still arranged in a Dim dimensional grid (see examples/Solvers/GMPGuardedJacobi). Future releases will include a tile-layout that can cover the domain with blocks that are not necessarily arranged on a grid.

The second template parameter specifies the type of Engine used in the patches. If MultiPatch<UniformTag, CompressibleBrick> is used as a template parameter in an Array declaration, then POOMA will not actually allocate memory for a patch if all of the values in that patch are known to be the same. For example, if a wave propagation program initializes all but a few array elements to zero, then the patches whose elements are all zero will be expanded automatically on demand. This can save significant execution time in the early stages of such calculations.

Note that POOMA can deal with MultiPatch arrays having different layouts. However, best performance is obtained when all layouts in an expression are the same (though some may have guard layers, as discussed in the following section).

Another variation on this program that uses threads explicitly is presented in the appendix. This program is more complex than the one above, but also has tighter control over what happens and when it happens.

Guard Layers

Multipatch arrays do present a complication to the evaluation of expressions. Evaluation of stencils such as those involved in the Jacobi iteration becomes tricky near the edge of a patch since data will be require from a neighboring patch. This is handled by evaluating the strips near the edge separately from the bulk of the patch. As the overhead for evaluating a patch is roughly constant, small sub-patch evaluations hurt efficiency.

One mechanism for fixing this problem is to introduce guard (or ghost) layers. This done by having the individual patches overlap slightly. Each patch still "owns" the same data as before, but surrounds that data with a layer of guards. These guards duplicate data that is owned by other patches, and can only be read from, not written. Now the evaluator can be written as a single loop over the entire owned portion of the patch, with the stencil terms reading from the guard layers. POOMA takes care of keeping the data in the guard layers in sync with the neighboring patches.

The guards described above are known as internal guards. POOMA also supports the notion of external guards. For Array objects, external guards are simply syntactic sugar for declaring a layer of cells around the domain of interest. POOMA Field objects hide the external guards and use them to calculate boundary conditions.

One can modify the Jacobi example simply by passing two GuardLayers objects to the layout constructor, one specifying the internal guards and another specifying the external guards:

// Specify the internal guards
GuardLayers<2> igls(1);

// Specify no external guards
GuardLayers<2> egls(0);

// Define the number of the patches.
Loc<2> patches(4, 4);

// Create a UniformGridLayout with internal guards.
UniformGridLayout<2> guardedLayout( guarded, patches, igls, egls );

Complete examples using guard cells are presented in the UMPGuardedJacobi and GMPGuardedJacobi examples in examples/Solvers.

POOMA can support different guard layers for each axis, and for both the high and low faces along each axis. These are specified by initializing the GuardLayers object with two raw int arrays, such as:

int lower[] = { 2, 0, 1 };
int upper[] = { 0, 0, 1 };

GuardLayers<3> internal(lower, upper);

This code fragment initialize internal to have a single guard layer on the lower side of the first dimension, and one on each the upper and lower sides of the third dimension.

Taking Layout Into Account

We now examine how to construct a loop-based reduction engine that takes into account some of the different ways POOMA arrays can be laid out in memory. Some aspects of this example are left unexplained, or glossed over, since the main intent of this example is to show how intermediate or advanced users of the library can tailor it to their needs.

The most common array layout in POOMA is called a brick layout, and is signaled by the use of the class Brick as an engine specifier in template instantiation. Conceptually, a brick is a dense, rectangular patch of multi-dimensional space, such as the area [0..10]×[0..10]. Programs written by the typical user access the elements of bricks using nested loops, the indices of which sweep through the brick's extent along a particular axis. Production code employs more complicated access loops in order to take full advantage of cache behavior.

The three functions accumulateWithLoop() defined below are the guts of the general-purpose adding routine that we will build up in this example. Each routine loops over the axes of an array of different dimension; the C++ compiler knows which version of the overloaded function to instantiate by pattern-matching the actual dimension of the array being summed with the dimension value specified as the first argument to ConstArray (i.e., 1, 2 or 3). The real version of this code has seven versions of accumulateWithLoop(), since POOMA arrays can have up to seven dimensions. Note that these functions have to be written explicitly, since there is no way in C++ to create entirely new statements (such as new nested loops) through template expansion.

template<class T, class E>
inline T accumulateWithLoop(
    const ConstArray<1,T,E>& x
){
    T sum = 0;
    int first0 = x.first(0), last0 = x.last(0);
    for (int i0=first0; i0<=last0; ++i0)
        sum += x(i0);
    return sum;
}

template<class T, class E>
inline T accumulateWithLoop(
    const ConstArray<2,T,E>& x
){
    T sum = 0;
    int first0 = x.first(0), f1 = x.first(1);
    int last0 = x.last(0),  l1 = x.last(1);
    for (int i1=f1; i1<=l1; ++i1)
        for (int i0=first0; i0<=last0; ++i0)
            sum += x(i0, i1);
    return sum;
}

template<class T, class E>
inline T accumulateWithLoop(
    const ConstArray<3,T,E>& x
){
    T sum = 0;
    int first0 = x.first(0), f1 = x.first(1), f2 = x.first(2);
    int last0 = x.last(0),  l1 = x.last(1),  l2 = x.last(2);
    for (int i2=f2; i2<=l2; ++i2)
        for (int i1=f1; i1<=l1; ++i1)
            for (int i0=first0; i0<=last0; ++i0)
                sum += x(i0, i1, i2);
    return sum;
}

The next step is to write four versions of the interface function that will actually be called by users. These four functions appear the same from a user's point of view (i.e., the syntax that a programmer types in to invoke these functions is indistinguishable). The first version uses explicit specialization to pattern-match arrays that have actual Brick engines:

template<int D, class T>
T accumulate(
    const ConstArray<D,T,Brick>& x
){
    return accumulateWithLoop(x);
}

This function just calls whichever version of accumulateWithLoop() handles arrays of dimension D. Since accumulateWithLoop() is an inline function, this one extra function call will be eliminated by the compiler when this code is optimized.

The second version of this function handles arrays whose engines are BrickViews, rather than Bricks. Recall that a BrickView is an alias for a subset of the elements in an actual Brick. The template class BrickView takes a dimension and a Boolean as template arguments; the Boolean specifies whether the BrickView can assume a unit stride along its first dimension. Taking a view of a Brick leads to this parameter being true; otherwise, it is false.

template<int D1, class T, int D2, bool S>
T accumulate(
    const ConstArray< D1, T, BrickView<D2,S> >& x
){
    return accumulateWithLoop(x);
}

The third version of accumulate() is the one that makes this example interesting:

template<int D, class T>
T accumulate(
    const ConstArray< D, T, MultiPatch<UniformTag,Brick> >& x
){
    typename UniformGridLayout<2>::iterator
        i = x.message(GetGridLayoutTag<2>()).begin(),
        e = x.message(GetGridLayoutTag<2>()).end();
    T sum = 0;
    while (i != e)
    {
        sum += accumulate(x(*i));
        ++i;
    }
    return sum;
}

Instances of the class UniformGridLayout store information about the patches that make up a uniform multi-patch. (They do other things as well; please see the POOMA documentation for the full list.) The first three lines of the function shown above declare a pair of iterators, which the function then uses to iterate through the patches of the array. The expression x(*i) accesses a single patch; this patch is then passed to whichever version of accumulate() is appropriate for patches of that kind.

Our final version of accumulate() exists to ensure that arrays using other storage mechanisms can still be summed. When the C++ compiler expands templates, it takes a more-specific match in preference to a less-specific match. Thus, since class E (i.e., a class variable) is used as the third parameter in the template parameter list below, instead of a concrete engine tag class such as Brick, the compiler will only choose this version of accumulate() when no other version will do:

template<int D, class T, class E>
T accumulate(
    const ConstArray<D,T,E>& x
){
    return accumulateWithLoop(x);
}

It is important to note that if the specialized versions of accumulate() had not been defined, this generic version would return the correct answer for any kind of array storage, including MultiPatch. The only advantage of looping over patches explicitly is that it yields better cache usage, and hence higher performance, since patch sizes are usually chosen so that the whole of each patch will fit into cache at once. POOMA therefore allows programmers to make sure that their code is working correctly before they start tuning it, and to tune programs incrementally based on the results of profiling.

For an example of a program that uses ideas like these, but manages threads explicitly, see the appendix.

Component Views

It is often useful to create an array of a structured type, such as a Vector<3>, and then select a view consisting of corresponding elements of that type, such as the Z component of the position that each Vector<3> represents. Such component views are closely related conceptually to the component forwarding introduced earlier. POOMA allows programs to create such views where the array type is itself a POOMA type. For example, suppose a program contains the following statements:

Array<2, Vector<3> > a(10, 10);
Array<2> b(10, 10);
b = a.comp(2);

The right-hand side of the assignment statement is a view of the third components of all of a's vector elements. This is implemented as an Array whose engine is a component forwarding engine. Data is only accessed on demand: the expression a.comp(2) does not copy values out of a into temporary storage.

If the source array of a component view is writable, then that component view can appear on either side of the assignment operator. For example:

Interval<1> I(5);
a(2, I).comp(1) = 3.14;

sets the second component of all of the vector elements in the slice to 3.14. The class ComponentView can also be used to make an object to store the view, as in:

typename ComponentView<Loc<1>, Array<2, Vector<3> > >::Type_t va = a.comp(1);

Here, the argument "Loc<1>" indicates that the component is singly-indexed. Up to 7 indices are supported, since programs can make Arrays with Array elements.

Summary

POOMA does its best to insulate programmers from the details of parallelism and modern memory hierarchies, but eventually these issues must be dealt with if high performance is to be achieved. This tutorial has therefore introduced some of the characteristics and capabilities of the POOMA library which developers must take into account in order to get the best possible performance from modern parallel computers, and some of the techniques (such as traits classes) which are used to implement the library.

[Prev]

[Home]

[Next]

POOMA Tutorial 4Further Topics

POOMA Tutorial 4
Further Topics