POOMA Tutorial 3
Calculating Residuals

Contents:
    Introduction
    Implementing Reduction Using Loops
    More on Domains
    Some Subtleties
    Using Built-In Reduction Functions
    A Look Under the Hood
    Summary

Introduction

It is easy to assign a scalar to an array in POOMA. Assigning an array to a scalar is a bit more complicated, since the array's values must be combined using some operator. Combinations of this kind are called reductions; common reduction operators include addition, maximum, and logical OR.

This tutorial shows how to perform reductions on arrays by querying their shape and size. Functions that do this are much more flexible than functions that rely on hard-coded dimensions and extents.

Implementing Reduction Using Loops

The programs shown so far have performed a fixed number of relaxation steps, with no regard for the actual rate at which the calculation is converging. A better strategy is to relax the system until the residual error is less than some threshold, while capping the number of iterations in order to avoid the program looping forever if it is given an ill-conditioned problem.

The program below (included in the release as examples/Solvers/Residuals) evaluates the residual error by summing the squares of the pointwise differences between the left and right sides of the usual update equation (line 16 in the code below). Lines 67-68 calculate this difference array, and pass it to the function sum_sqr().

01  #include "Pooma/Arrays.h"
02
03  #include <iostream>
04
05  // The size of each side of the domain.
06  const int N = 20;
07
08  // Apply a Jacobi iteration on the given domain.
09  void
10  ApplyJacobi(
11      const Array<2>       & V,              // the domain
12      const ConstArray<2>  & b,              // constant condition
13      const Range<1>       & I,              // first axis subscript
14      const Range<1>       & J               // second axis subscript
15  ){
16      V(I,J) = 0.25 * (V(I+1,J) + V(I-1,J) + V(I,J+1) + V(I,J-1) - b(I,J));
17  }
18  
19  // Calculate the sum of squares of all the elements in a 2D ConstArray.
20  template<class ValueType, class EngineTag>
21  ValueType sum_sqr(const ConstArray<2, ValueType, EngineTag> &A)
22  {
23      ValueType sum = 0;
24  
25      int first_0 = A.domain()[0].first(),
26          last_0  = A.domain()[0].last(),
27          first_1 = A.domain()[1].first(),
28          last_1  = A.domain()[1].last();
29  
30      for (int index_0=first_0; index_0<=last_0; ++index_0)
31      {
32          for (int index_1=first_1; index_1<=last_1; ++index_1)
33          {
34              ValueType value = A(index_0, index_1);
35              sum += value * value;
36          }
37      }
38      return sum;
39  }
40  
41  int
42  main(
43      int                 argc,           // argument count
44      char*               argv[]          // argument list
45  ){
46      // Initialize POOMA.
47      Pooma::initialize(argc, argv);
48
49      // The array we'll be solving for.
50      Array<2> V(N, N);
51      V = 0.0;
52  
53      // The right hand side of the equation.
54      Array<2> b(N, N);
55      b = 0.0;
56      b(N/2, N/2) = -1.0;
57  
58      // The interior domain.
59      Interval<1> I(1, N-2), J(1, N-2);
60  
61      // Iterate until converged, or a max of 1000 time steps.
62      double residual = 1.0; // anything greater than threshold
63      int iteration;
64      for (iteration=0; iteration<1000 && residual>1e-6; ++iteration)
65      {
66          ApplyJacobi(V, b, I, J);
67          residual = sum_sqr(V(I+1,J) + V(I-1,J) + V(I,J+1) + V(I,J-1)
68                             - (b(I,J) + 4.0*V(I,J)));
69      }
70  
71      // Print out the result.
72      std::cout << "Iterations = " << iteration << std::endl;
73      std::cout << "Residual = "   << residual  << std::endl;
74      std::cout << V << std::endl;
75
76      // Clean up and report success.
77      Pooma::finalize();
78      return 0;
79  }

The function sum_sqr() takes a 2-dimensional array of arbitrary type, with an arbitrary engine, as its argument. Templating this function by the value type of the array means that the function can be used efficiently for arrays of other types, such as int, without any changes. Templating on the engine tag type is at least as important, for reasons that will be discussed below.

Line 23 declares the function's result variable. This declaration uses the type parameter ValueType, so that sum_sqr will work for arrays of any base type supporting addition, product, and assignment (the three operations applied to sum and the values read from the array). Note that sum_sqr() is defined to return a ValueType as well.

Lines 25-28 then determine the extent of the array A. The method Array::domain() returns an instance of the templated class Domain, which records the extent of the array's domain along each axis. Subscripting the result of A.domain() with 0 or 1 returns a temporary object that represents the size of the specified axis; the first() and last() calls on lines 25-28 therefore record the starting and ending indices of the array in both dimensions, regardless of the array's type or storage mechanism.

Lines 25-28 could also be written:

int first_0 = A.first(0),
    last_0  = A.last(0),
    first_1 = A.first(1),
    last_1  = A.last(1);

since Array provides short cuts for accessing domain extents.

It is important to note that the indices to sum_sqr()'s argument A are contiguous; that is, they run from 0 to an upper bound with unit stride in both dimensions. One of the many purposes of the Array class is to map logical, user-level indices to an actual data area. Once an array section has been selected by using Intervals or Ranges as indices, that section appears to users to be compact and contiguous. Thus, the following (very contrived) code first sets every third element of a vector to 3, then sets every ninth element to 9, since the recursive call to setThree() selects every third element from its argument, which itself is every third argument of the original array:

const int N = 20;
Range<1> stride(0, N-1, 3);

void setThree(Array<1> a, double value, bool recurse)
{
    a = value;
    if (recurse){
        Range<1> newStride(0, (N-1)/3, 3);
        setThree(a(newStride), value*value, false);
    }
}

int main(int argc, char* argv[])
{
    Array<1> a(N);
    a = 0;
    setThree(a(stride), 3, true);
}

The rest of this function is straightforward. The nested loops beginning on line 30 traverse the array along both axes; the assignment on line 34 reads a value from the array, while that on line 35 accumulates the square of each value in sum.

More on Domains

POOMA provides a few useful shortcuts for working with domains, which can be used to generalize routines that manipulate arrays of varying sizes and shapes. The current version of the library provides three "wildcard" domains:

AllDomain<Dim> does not take any constructor arguments. It is interpreted to mean "the whole of the relevant domain".
LeftDomain<Dim>'s constructors take either a set of Dim integers, or a Loc<Dim>. It interprets these as the right endpoint of a new domain, and is used to specify a left (low-indexed) subdomain within a larger domain.
RightDomain<Dim> is similar to LeftDomain, but is interpreted as the left endpoint of a (high-indexed) subdomain within a larger domain.

Wildcards are used to take a view of an existing Array in a way that is relative to the existing domain. For example, suppose a program has defined an Array<2> on the domain [1:10:1, 5:8:1]:

Array<2> A(Interval<1>(1,10), Interval<1>(5,8));

The following expression would take a view of this array that included the elements [3:6:1, 5:8:1] (i.e., only some of the first dimension, but all of the second):

A(Interval<1>(3,6), AllDomain<1>());

Note that the parentheses after AllDomain<1> are necessary because this statement is constructing an unnamed instance of this templated class.

If a program wants to take a view that starts with a given endpoint on one end, and uses the existing endpoint on the other end, it must use the LeftDomain or RightDomain wildcards. For example:

A(LeftDomain<1>(6), RightDomain<1>(7));

accesses the elements of A in the domain [1:6:1, 7:8:1], where A is the same array declared above. Note that these wildcard domains are inclusive at both ends: LeftDomain uses the left portion of the existing domain, chopping it off at the given right endpoint, while RightDomain uses the right portion of the existing domain, starting from the given left endpoint.

Domain wildcards can be used in combination with Array methods such as first() and last() to get views that refer to just the left or right edges of a domain. For example:

A(LeftDomain<1>(A.first(0) + 1), AllDomain<1>())

refers to the width-2 domain on left edge of the first dimension of the array A.

Finally, Array and ConstArray overload operator(), and provide a method read(), to return a view of the array's entire domain. The Array version of operator() returns a writable view; otherwise, the view is read-only. These methods are useful because they allow programmers to write zero-based algorithms for arrays, no matter their domain. For example, the following copies elements of b into a:

Array<1> a(Interval<1>(-4, 0)), b(5);
for (int i = 0; i < 5; i++)
  a()(i) = b()(i);

Some Subtleties

Returning to the implementation of reductions once again, the most interesting thing about the sum_sqr() function is not its implementation, but what gets passed to it. The function call on lines 67-68 binds sum_sqr()'s argument A to the result of the expression:

V(I+1,J) + V(I-1,J) + V(I,J+1) + V(I,J-1) - (b(I,J) + 4.0*V(I,J)));

Most languages that support whole-array operations, such as Fortran 90, would create a full-sized temporary array by evaluating this expression at every point, and then pass that temporary variable to sum_sqr(). POOMA does not do this; what it does instead is the key to its high performance.

Recall once again that arrays in POOMA do not actually store data, but instead act as handles on engines that know how to return values given sets of indices. Some engines reference data storage directly; that is, they translate a set of indices into a value by looking up the value corresponding to those indices in memory. However, POOMA also contains expression engines, which use expression templates to calculate array values on demand.

When an expression like the one above is encountered, POOMA does not calculate all of its values at once. Instead, the expansion of the overloaded operators used in the expression creates an expression engine as the program is compiled. Whenever the array wrapper around this engine is subscripted, the engine calculates and returns the corresponding value.

This technique is called "lazy evaluation" and is the reason why the body of the inner loop of sum_sqr() is written as:

ValueType value = A(index_0, index_1);
sum += value * value;

If the body of this loop was instead written as:

sum += A(index_0, index_1) * A(index_0, index_1);

then the expression engine would evaluate the value of A at each location twice, since subscripting A is what triggers element evaluation.

The existence of expression engines is one of the reasons why sum_sqr(), and other functions that use POOMA, should template the engine type as well as the data type of their arguments. If the engine type of sum_sqr()'s A argument was not templated, it would default to BrickEngine, which is the engine that manages a dense, contiguous block of memory. The call on lines 67-68 would therefore be evaluated by constructing an expression engine (good), then evaluating it at each location in order to fill in the argument A (bad).

One mistake that is commonly made by programmers who are first starting to use POOMA is to forget that different arguments to a function can have different data or engine types. Consider, for example, a function whose job it is to compute the sum of the squares of the elementwise difference between two vectors. The natural way to write it (assuming that the lengths of the vectors are known to be the same) is:

template<class ValueType, class EngineTag>
ValueType sum_sqr_diff(
    const ConstArray<1, ValueType, EngineTag>& Left,
    const ConstArray<1, ValueType, EngineTag>& Right
){
    ValueType sum = 0;

    int first = Left.first(0),
        last  = Left.last(0);

    for (int index=first; index<=last; ++index)
    {
        ValueType value = Left(index) - Right(index);
        sum += value * value;
    }

    return sum;
}

However, if sum_sqr_diff() is written this way, the compiler can only instantiate it when the data types and the engines of its arguments are exactly the same. This means that a call like:

Array<1, int>   intvec(10);
Array<1, float> floatvec(10);
double result = sum_sqr_diff(intvec, floatvec);

would fail to compile, since the template type argument ValueType cannot simultaneously match int, float, and double. Similarly, if one argument to sum_sqr_diff() was a plain old Array, while another argument was an expression, the compiler would either have to force full evaluation of the expression (in order to get something with the same engine type as the plain old array), or give up and report an error.

The most general way to define this function is as shown below. Both the data and engine types of the arguments are independent, so that the compiler has the degrees of freedom it needs to instantiate this function for a wide variety of arguments:

template<class LeftValueType,  class LeftEngineTag,
         class RightValueType, class RightEngineTag>
double sum_sqr_diff(
    const ConstArray<1, LeftValueType,  LeftEngineTag>&  Left,
    const ConstArray<1, RightValueType, RightEngineTag>& Right
){
    double sum = 0;

    int first = Left.first(0),
        last  = Left.last(0);

    for (int index=first; index<=last; ++index)
    {
        double value = Left(index) - Right(index);
        sum += value * value;
    }

    return sum;
}

But what's this? Everything important in this function is templated, except its double return type. If a user-defined extra-precision numerical type is used in the array arguments, the accumulator will have lower precision than the values being accumulated. Why isn't this function defined as:

// ILLEGAL
template<class LeftValueType,  class LeftEngineTag,
         class RightValueType, class RightEngineTag,
         class ReturnType>
ReturnType sum_sqr_diff(
    const ConstArray<1, LeftValueType,  LeftEngineTag>&  Left,
    const ConstArray<1, RightValueType, RightEngineTag>& Right
){
    ReturnType sum = 0;
    // body of function
    return sum;
}

so that the compiler can, for example, instantiate the function with a return type of QuadPrecision when presented with the following:

Array<2, QuadPrecision> A(N, N);
Array<2, double>        B(N, N);
initialization
QuadPrecision result = sum_sqr_diff(A, B);

The unfortunate answer is that overloading and templates in C++ doesn't work that way. To take a simple example, suppose an application has a set of functions for writing to a file, such as put(char), put(char*), and put(int). When the code put(x) is seen, the compiler uses only the type of the argument x to figure out which function to call. There is no way for the compiler to distinguish between:

ostream s = put(x);

and

FILE* s=put(x);

because the return type of put() is not considered by the compiler during template instantiation. While there are good technical reasons for this, it is one of the biggest obstacles that the implementers of the POOMA library (and other templated libraries) have had to face. As the workarounds used in the library itself are too complex for these introductory tutorials, the best solution for newcomers to the library is either to use a high-precision type like double, or to use the type of one of the arguments to the function or method, and hope that it will be sufficiently precise.

Using Built-In Reduction Functions

The sum_sqr() function on lines 20-39 above uses object-oriented techniques to achieve generality, but still has the loops of a C or Fortran 77 program. These loops not only clutter the code, they also do not exploit any parallelism that the hardware this program is running on might offer. A much better solution is to use one of POOMA's built-in reduction functions, in this case sum(). The program that does this is included in the release as examples/Solvers/Residual2; the key change, to sum_sqr(), is shown below:

template<class ValueType, class EngineTag>
ValueType sum_sqr(const ConstArray<2, ValueType, EngineTag>& A)
{
    return sum(A * A);
}

As might be expected, POOMA provides many other reduction functions:

`sum`	sum all the elements in an array
`prod`	multiply all of the elements in an array
`max`	find the maximum value in an array
`min`	find the minimum value in an array
`all`	returns true if all of the array's elements are non-zero
`any`	returns true if any of the array's elements are non-zero
`bitOr`	does a bitwise or of all of the elements
`bitAnd`	does a bitwise and of all of the elements

Note that since names such as bitor and bitand are actually reserved keywords in C++, some of these functions have names such as bitOr and bitAnd.

POOMA presently puts its reduction operators, along with most of the other things it defines, in the global namespace. Since all of these operators are templated on POOMA classes, there is very little chance of collision with other functions with the same names. While it would be better programming practice to put everything into a namespace, like POOMA's initialize() and finalize() functions, some compilers still have trouble with the combination of templates and namespaces. Once these compilers are brought up to full ANSI/ISO compliance, all POOMA functions and classes will be placed in the Pooma:: namespace.

Of course, the obvious next step is to get rid of sum_sqr() entirely, and move the residual calculation into the main loop:

Array<2> temp;
for (iteration=0; iteration<1000 && residual>1e-6; ++iteration)
{
    ApplyJacobi(V, b, I, J);
    temp = V(I+1,J) + V(I-1,J) + V(I,J+1) + V(I,J-1) - (b(I,J) + 4.0*V(I,J));
    residual = sum(temp * temp);
}

This is tempting, but wrong: if you compare the performance of this version of the program to that of the original, you will find that this one is significantly slower. The reason is that the assignment to temp in the middle of the loop above does not create an expression engine, but instead allocates and fills in an array. Only by passing an expression to a templated function can a program give the compiler an opportunity to capture enough information about the expression to create an array which is bound to an expression engine.

Life would clearly be better if there was some way to declare temporary array variables that were guaranteed to be bound to expression engines, instead of storage engines. However, in order to do this, programmers would have to specify the type of the temporary array exactly. By the time the residual expression above has been expanded, its type definition is several thousand characters long; the complexity of the types of longer expressions grows very, very quickly.

A Look Under the Hood

By now, you may be curious about how POOMA does what it does. This section therefore takes a look at the implementation of reduction operators; while many details are omitted, it should give you some idea of how the library is structured, and why some of its features appear the way they do.

We have three requirements for a global reduction function such as sum(): it must be able to reduce arrays of arbitrary size, it must be able to reduce arrays of arbitrary type, and it must efficiently use the same underlying machinery as other reductions. The first two criteria need no justification; the third one is a software engineering concern. If each reduction function has to be completely self-contained (i.e., if all of the parallelization and looping code has to be duplicated), then maintaining the library will be difficult. On the other hand, we cannot afford to write a generic reduction routine that takes a function pointer or an object with a virtual method as an argument, since the cost of indirection inside an inner loop is unacceptable.

POOMA's authors solve these problems by using a trait class to represent each primitive reduction operation. A trait class is a class whose only purpose is to be used to instantiate other templated classes. Each class in a family of trait classes defines constants, enumeration elements, and methods with identical names and signatures, so that they can be used interchangeably.

Trait classes are not part of the C++ language definition, but are instead a way of using template instantiation as an abstraction mechanism in yet another way. Instead of overriding virtual functions inherited from parent classes, templated classes can use constants and methods supplied by template parameters in generic ways. For example, either of the classes Red and Green in the code below can be used to instantiate Blue, but when the values of those different instances are printed, they display different values:

struct Red
{
    enum { Val = 123; }
};

struct Green
{
    enum { Val = 456; }
};

template<class T>
class Blue
{
  public:
    Blue() : val_(T::Val) {}
    const unsigned int val_;
};

int main()
{
    Blue<Red>   br;
    Blue<Green> bg;
    std::cout << br.val_ << std::endl;
    std::cout << bg.val_ << std::endl;
    return 0;
}

Note that the example above declares Red and Green as structs instead of as classes. The only difference between the two kinds of declarations is that a struct's members are public by default, while those of a class are private. This saves one line each in the definitions of Red and Green.

POOMA defines a family of trait classes representing the C++ assignment operators. These classes, which are part of the Portable Expression Template Engine (PETE), are defined as follows:

PETE_DEFINE_ASSIGN_OP((a = b),   OpAssign)
PETE_DEFINE_ASSIGN_OP((a += b),  OpAddAssign)
PETE_DEFINE_ASSIGN_OP((a -= b),  OpSubtractAssign)
PETE_DEFINE_ASSIGN_OP((a *= b),  OpMultiplyAssign)
PETE_DEFINE_ASSIGN_OP((a /= b),  OpDivideAssign)
PETE_DEFINE_ASSIGN_OP((a %= b),  OpModAssign)
PETE_DEFINE_ASSIGN_OP((a |= b),  OpBitwiseOrAssign)
PETE_DEFINE_ASSIGN_OP((a &= b),  OpBitwiseAndAssign)
PETE_DEFINE_ASSIGN_OP((a ^= b),  OpBitwiseXorAssign)
PETE_DEFINE_ASSIGN_OP((a <<= b), OpLeftShiftAssign)
PETE_DEFINE_ASSIGN_OP((a >>= b), OpRightShiftAssign)

where each use of the macro PETE_DEFINE_ASSIGN_OP defines a struct with the same members:

#define PETE_DEFINE_ASSIGN_OP(Expr,Op)                      \
struct Op {                                                 \
    Op() {}                                                 \
                                                            \
    Op(const Op&) {}                                        \
                                                            \
    enum {                                                  \
        tag = BinaryUseLeftTag                              \
    };                                                      \
                                                            \
    template<class T1, class T2>                            \
    inline typename BinaryReturn<T1, T2, Op>::Type_t        \
    operator()(                                             \
        T1&       a,                                        \
        const T2& b                                         \
    ) const {                                               \
        return Expr;                                        \
    }                                                       \
};

This struct has four members: a default constructor, a copy constructor, a constant (defined using an enum) called tag, and a templated method operator(). (The default and copy constructors might appear to be unnecessary, but omitting them results in Uninitialized Memory Read (UMR) warnings from memory checking tools such as Purify.) As discussed earlier, templated methods are instantiated when and as required, just like templated functions, but are still members of their containing class. In this case, the templated operator() takes a destination argument a of one type, and a source argument b of another type, and performs the expression Expr (such as "a+=b") on them.

Thus, whenever OpAddAssign or another class in this family of trait classes is used in an expression, the templated method operator() is instantiated with the appropriate types (such as int for the source, and double for the destination). Since this method is not virtual, there is no abstraction penalty: code using any particular instantiation of this templated method will run at maximum speed.

The only feature of this macro that has not yet been explained is the use of BinaryReturn. This is another trait class, whose only purpose is to define the type of the result of applying an operation Op to values of types T1 and T2. For example, BinaryReturn<int,float,OpAdd> defines Type_t to be float, while BinaryReturn<double,float,OpMul> defines Type_t to be double. Again, POOMA uses template instantiation as an abstraction mechanism, so that logically repetitive code does not have to be physically replicated. (This is an illustration of how to solve the problem discussed earlier of how to generalize the return type of a function so that it is adequate for the input argument types.)

With all of this machinery in place, sum() is now easy to build:

template<int Dim, class T, class EngineTag>
inline T sum(
    const ConstArray<Dim, T, EngineTag>& a
){
    return globalReduction(a, T(0), OpAddAssign());
}

The function just passes the array, a value to initialize the sum with, and the reduction operation to be applied to a generic templated function called globalReduction(). The first argument to this function is the array; the second is an initialization value (which is also the value returned if the array is empty), and the third specifies the operation to apply. By the time globalReduction() is expanded, OpAddAssign::operator() has been inlined by POOMA's expression template machinery. Note that the initialization value must be an identity element for the operation; while zero works in most cases, some operations (such as logical and bitwise OR) must use other values.

Summary

POOMA achieves high performance using expression engines, which are constructed automatically during compilation, and which evaluate complex array expressions on demand in order to avoid creation of temporary arrays. In order to support mixed data types, and the use of both arrays and array expressions as arguments, user-defined functions should be templated separately by both the data type and engine type of all of their arguments. Unfortunately, C++ does not support templatization on function return type, which can make it difficult to write fully-generic functions. Finally, POOMA provides several built-in reduction functions, such as summation, product, and logical combination. These are implemented using a generic framework, which can be extended by knowledgeable users.

[Prev]

[Home]

[Next]

POOMA Tutorial 3Calculating Residuals