Calculating Residuals

**Contents:**

Introduction

Implementing Reduction Using Loops

More on Domains

Some Subtleties

Using Built-In Reduction Functions

A Look Under the Hood

Summary

It is easy to assign a scalar to an array in POOMA.
Assigning an array to a scalar is a bit more complicated, since the
array's values must be combined using some operator. Combinations of
this kind are called *reductions*; common reduction operators
include addition, maximum, and logical OR.

This tutorial shows how to perform reductions on arrays by querying their shape and size. Functions that do this are much more flexible than functions that rely on hard-coded dimensions and extents.

The programs shown so far have performed a fixed number of relaxation steps, with no regard for the actual rate at which the calculation is converging. A better strategy is to relax the system until the residual error is less than some threshold, while capping the number of iterations in order to avoid the program looping forever if it is given an ill-conditioned problem.

The program below (included in the release as
`examples/Solvers/Residuals`) evaluates the residual error by
summing the squares of the pointwise differences between the left and
right sides of the usual update equation (line 16 in the code
below). Lines 67-68 calculate this difference array, and pass it to
the function `sum_sqr()`.

01 #include "Pooma/Arrays.h" 02 03 #include <iostream> 04 05 //The size of each side of the domain.06 const int N = 20; 07 08 //Apply a Jacobi iteration on the given domain.09 void 10 ApplyJacobi( 11 const Array<2> & V, //the domain12 const ConstArray<2> & b, //constant condition13 const Range<1> & I, //first axis subscript14 const Range<1> & J //second axis subscript15 ){ 16 V(I,J) = 0.25 * (V(I+1,J) + V(I-1,J) + V(I,J+1) + V(I,J-1) - b(I,J)); 17 } 18 19 //Calculate the sum of squares of all the elements in a 2D ConstArray.20 template<class ValueType, class EngineTag> 21 ValueType sum_sqr(const ConstArray<2, ValueType, EngineTag> &A) 22 { 23 ValueType sum = 0; 24 25 int first_0 = A.domain()[0].first(), 26 last_0 = A.domain()[0].last(), 27 first_1 = A.domain()[1].first(), 28 last_1 = A.domain()[1].last(); 29 30 for (int index_0=first_0; index_0<=last_0; ++index_0) 31 { 32 for (int index_1=first_1; index_1<=last_1; ++index_1) 33 { 34 ValueType value = A(index_0, index_1); 35 sum += value * value; 36 } 37 } 38 return sum; 39 } 40 41 int 42 main( 43 int argc, //argument count44 char* argv[] //argument list45 ){ 46 //Initialize POOMA.47 Pooma::initialize(argc, argv); 48 49 //The array we'll be solving for.50 Array<2> V(N, N); 51 V = 0.0; 52 53 //The right hand side of the equation.54 Array<2> b(N, N); 55 b = 0.0; 56 b(N/2, N/2) = -1.0; 57 58 //The interior domain.59 Interval<1> I(1, N-2), J(1, N-2); 60 61 //Iterate until converged, or a max of 1000 time steps.62 double residual = 1.0; //anything greater than threshold63 int iteration; 64 for (iteration=0; iteration<1000 && residual>1e-6; ++iteration) 65 { 66 ApplyJacobi(V, b, I, J); 67 residual = sum_sqr(V(I+1,J) + V(I-1,J) + V(I,J+1) + V(I,J-1) 68 - (b(I,J) + 4.0*V(I,J))); 69 } 70 71 //Print out the result.72 std::cout << "Iterations = " << iteration << std::endl; 73 std::cout << "Residual = " << residual << std::endl; 74 std::cout << V << std::endl; 75 76 //Clean up and report success.77 Pooma::finalize(); 78 return 0; 79 }

The function `sum_sqr()` takes a 2-dimensional array of
arbitrary type, with an arbitrary engine, as its argument. Templating
this function by the value type of the array means that the function
can be used efficiently for arrays of other types, such as `int`,
without any changes. Templating on the engine tag type is at least as
important, for reasons that will be discussed below.

Line 23 declares the function's result variable. This
declaration uses the type parameter `ValueType`, so that
`sum_sqr` will work for arrays of any base type supporting
addition, product, and assignment (the three operations applied to
`sum` and the values read from the array). Note that
`sum_sqr()` is defined to return a `ValueType` as
well.

Lines 25-28 then determine the extent of the array `A`. The
method `Array::domain()` returns an instance of the
templated class `Domain`, which records the extent of
the array's domain along each axis. Subscripting the result of
`A.domain()` with 0 or 1 returns a temporary object that
represents the size of the specified axis; the `first()` and
`last()` calls on lines 25-28 therefore record the starting and
ending indices of the array in both dimensions, regardless of the
array's type or storage mechanism.

Lines 25-28 could also be written:

int first_0 = A.first(0), last_0 = A.last(0), first_1 = A.first(1), last_1 = A.last(1);

since `Array` provides short cuts for accessing
domain extents.

It is important to note that the indices to `sum_sqr()`'s
argument `A` are contiguous; that is, they run from 0 to an upper bound
with unit stride in both dimensions. One of the many purposes of the
`Array` class is to map logical, user-level indices to
an actual data area. Once an array section has been selected by using
`Interval`s or `Range`s as indices, that
section appears to users to be compact and contiguous. Thus, the
following (very contrived) code first sets every third element of a
vector to 3, then sets every ninth element to 9, since the recursive
call to `setThree()` selects every third element from its
argument, which itself is every third argument of the original array:

const int N = 20; Range<1> stride(0, N-1, 3); void setThree(Array<1> a, double value, bool recurse) { a = value; if (recurse){ Range<1> newStride(0, (N-1)/3, 3); setThree(a(newStride), value*value, false); } } int main(int argc, char* argv[]) { Array<1> a(N); a = 0; setThree(a(stride), 3, true); }

The rest of this function is straightforward. The nested loops
beginning on line 30 traverse the array along both axes; the
assignment on line 34 reads a value from the array, while that on
line 35 accumulates the square of each value in `sum`.

POOMA provides a few useful shortcuts for working with domains, which can be used to generalize routines that manipulate arrays of varying sizes and shapes. The current version of the library provides three "wildcard" domains:

`AllDomain<Dim>`does not take any constructor arguments. It is interpreted to mean "the whole of the relevant domain".`LeftDomain<Dim>`'s constructors take either a set of`Dim`integers, or a`Loc<Dim>`. It interprets these as the right endpoint of a new domain, and is used to specify a left (low-indexed) subdomain within a larger domain.`RightDomain<Dim>`is similar to`LeftDomain`, but is interpreted as the left endpoint of a (high-indexed) subdomain within a larger domain.

Wildcards are used to take a view of an existing
`Array` in a way that is relative to the existing
domain. For example, suppose a program has defined an
`Array<2>` on the domain `[1:10:1, 5:8:1]`:

Array<2> A(Interval<1>(1,10), Interval<1>(5,8));

The following expression would take a view of this array that
included the elements `[3:6:1, 5:8:1]` (i.e., only some of
the first dimension, but all of the second):

A(Interval<1>(3,6), AllDomain<1>());

Note that the parentheses after `AllDomain<1>` are
necessary because this statement is constructing an unnamed instance
of this templated class.

If a program wants to take a view that starts with a given endpoint
on one end, and uses the existing endpoint on the other end, it must
use the `LeftDomain` or `RightDomain`
wildcards. For example:

A(LeftDomain<1>(6), RightDomain<1>(7));

accesses the elements of `A` in the domain
`[1:6:1, 7:8:1]`, where `A` is the same array
declared above. Note that these wildcard domains are inclusive at
both ends: `LeftDomain` uses the left portion of the
existing domain, chopping it off at the given right endpoint, while
`RightDomain` uses the right portion of the existing
domain, starting from the given left endpoint.

Domain wildcards can be used in combination with
`Array` methods such as `first()` and
`last()` to get views that refer to just the left or right
edges of a domain. For example:

A(LeftDomain<1>(A.first(0) + 1), AllDomain<1>())

refers to the width-2 domain on left edge of the first dimension
of the array `A`.

Finally, `Array` and `ConstArray` overload
`operator()`, and provide a method `read()`, to return a
view of the array's entire domain. The `Array` version of
`operator()` returns a writable view; otherwise, the view is
read-only. These methods are useful because they allow programmers to
write zero-based algorithms for arrays, no matter their domain. For
example, the following copies elements of `b` into `a`:

Array<1> a(Interval<1>(-4, 0)), b(5); for (int i = 0; i < 5; i++) a()(i) = b()(i);

Returning to the implementation of reductions once again, the most
interesting thing about the `sum_sqr()` function is not its
implementation, but what gets passed to it. The function call on
lines 67-68 binds `sum_sqr()`'s argument `A` to the
result of the expression:

V(I+1,J) + V(I-1,J) + V(I,J+1) + V(I,J-1) - (b(I,J) + 4.0*V(I,J)));

Most languages that support whole-array operations, such as
Fortran 90, would create a full-sized temporary array by
evaluating this expression at every point, and then pass that
temporary variable to `sum_sqr()`. POOMA does not do
this; what it does instead is the key to its high performance.

Recall once again that arrays in POOMA do not actually
store data, but instead act as handles on engines that know how to
return values given sets of indices. Some engines reference data
storage directly; that is, they translate a set of indices into a value by
looking up the value corresponding to those indices in memory.
However, POOMA also contains *expression engines*,
which use expression
templates to calculate array values on demand.

When an expression like the one above is encountered, POOMA
does *not* calculate all of its values at once. Instead, the
expansion of the overloaded operators used in the expression creates
an expression engine as the program is compiled. Whenever the array
wrapper around this engine is subscripted, the engine calculates and
returns the corresponding value.

This technique is called "lazy evaluation" and is the reason why
the body of the inner loop of `sum_sqr()` is written as:

ValueType value = A(index_0, index_1); sum += value * value;

If the body of this loop was instead written as:

sum += A(index_0, index_1) * A(index_0, index_1);

then the expression engine would evaluate the value of `A`
at each location twice, since subscripting `A` is what triggers
element evaluation.

The existence of expression engines is one of the reasons why
`sum_sqr()`, and other functions that use POOMA, should
template the engine type as well as the data type of their arguments.
If the engine type of `sum_sqr()`'s `A` argument was not
templated, it would default to `BrickEngine`, which is the
engine that manages a dense, contiguous block of memory. The call on
lines 67-68 would therefore be evaluated by constructing an expression
engine (good), then evaluating it at each location in order to fill in
the argument `A` (bad).

One mistake that is commonly made by programmers who are first starting to use POOMA is to forget that different arguments to a function can have different data or engine types. Consider, for example, a function whose job it is to compute the sum of the squares of the elementwise difference between two vectors. The natural way to write it (assuming that the lengths of the vectors are known to be the same) is:

template<class ValueType, class EngineTag> ValueType sum_sqr_diff( const ConstArray<1, ValueType, EngineTag>& Left, const ConstArray<1, ValueType, EngineTag>& Right ){ ValueType sum = 0; int first = Left.first(0), last = Left.last(0); for (int index=first; index<=last; ++index) { ValueType value = Left(index) - Right(index); sum += value * value; } return sum; }

However, if `sum_sqr_diff()` is written this way, the compiler
can only instantiate it when the data types *and* the engines of
its arguments are exactly the same. This means that a call like:

Array<1, int> intvec(10); Array<1, float> floatvec(10); double result = sum_sqr_diff(intvec, floatvec);

would fail to compile, since the template type argument
`ValueType` cannot simultaneously match `int`,
`float`, and `double`. Similarly, if one argument to
`sum_sqr_diff()` was a plain old `Array`, while
another argument was an expression, the compiler would either have to
force full evaluation of the expression (in order to get something
with the same engine type as the plain old array), or give up and
report an error.

The most general way to define this function is as shown below. Both the data and engine types of the arguments are independent, so that the compiler has the degrees of freedom it needs to instantiate this function for a wide variety of arguments:

template<class LeftValueType, class LeftEngineTag, class RightValueType, class RightEngineTag> double sum_sqr_diff( const ConstArray<1, LeftValueType, LeftEngineTag>& Left, const ConstArray<1, RightValueType, RightEngineTag>& Right ){ double sum = 0; int first = Left.first(0), last = Left.last(0); for (int index=first; index<=last; ++index) { double value = Left(index) - Right(index); sum += value * value; } return sum; }

But what's this? Everything important in this function is
templated, except its `double` return type. If a user-defined
extra-precision numerical type is used in the array arguments, the
accumulator will have lower precision than the values being
accumulated. Why isn't this function defined as:

//ILLEGALtemplate<class LeftValueType, class LeftEngineTag, class RightValueType, class RightEngineTag, class ReturnType> ReturnType sum_sqr_diff( const ConstArray<1, LeftValueType, LeftEngineTag>& Left, const ConstArray<1, RightValueType, RightEngineTag>& Right ){ ReturnType sum = 0; //body of functionreturn sum; }

so that the compiler can, for example, instantiate the function
with a return type of `QuadPrecision` when presented with the
following:

Array<2, QuadPrecision> A(N, N); Array<2, double> B(N, N);initializationQuadPrecision result = sum_sqr_diff(A, B);

The unfortunate answer is that overloading and templates in C++
doesn't work that way. To take a simple example, suppose an
application has a set of functions for writing to a file, such as
`put(char)`, `put(char*)`, and `put(int)`. When
the code `put(x)` is seen, the compiler uses *only* the
type of the argument `x` to figure out which function to call.
There is no way for the compiler to distinguish between:

ostream s = put(x);

and

FILE* s=put(x);

because the return type of `put()` is *not*
considered by the compiler during template instantiation. While there
are good technical reasons for this, it is one of the biggest
obstacles that the implementers of the POOMA library (and other
templated libraries) have had to face. As the workarounds used in the
library itself are too complex for these introductory tutorials, the
best solution for newcomers to the library is either to use a
high-precision type like `double`, or to use the type of one of
the arguments to the function or method, and hope that it will be
sufficiently precise.

The `sum_sqr()` function on lines 20-39 above uses
object-oriented techniques to achieve generality, but still has the
loops of a C or Fortran 77 program. These loops not only clutter
the code, they also do not exploit any parallelism that the hardware
this program is running on might offer. A much better solution is to
use one of POOMA's built-in reduction functions, in this case
`sum()`. The program that does this is included in the release
as `examples/Solvers/Residual2`; the key change, to
`sum_sqr()`, is shown below:

template<class ValueType, class EngineTag> ValueType sum_sqr(const ConstArray<2, ValueType, EngineTag>& A) { return sum(A * A); }

As might be expected, POOMA provides many other reduction functions:

sum
| sum all the elements in an array |

prod
| multiply all of the elements in an array |

max
| find the maximum value in an array |

min
| find the minimum value in an array |

all
| returns true if all of the array's elements are non-zero |

any
| returns true if any of the array's elements are non-zero |

bitOr
| does a bitwise or of all of the elements |

bitAnd
| does a bitwise and of all of the elements |

POOMA presently puts its reduction operators, along with
most of the other things it defines, in the global namespace. Since
all of these operators are templated on POOMA classes, there
is very little chance of collision with other functions with the same
names. While it would be better programming practice to put
everything into a namespace, like POOMA's
`initialize()` and `finalize()` functions, some
compilers still have trouble with the combination of templates and
namespaces. Once these compilers are brought up to full ANSI/ISO
compliance, all POOMA functions and classes will be placed in
the `Pooma::` namespace.

Of course, the obvious next step is to get rid of
`sum_sqr()` entirely, and move the residual calculation
into the main loop:

Array<2> temp; for (iteration=0; iteration<1000 && residual>1e-6; ++iteration) { ApplyJacobi(V, b, I, J); temp = V(I+1,J) + V(I-1,J) + V(I,J+1) + V(I,J-1) - (b(I,J) + 4.0*V(I,J)); residual = sum(temp * temp); }

This is tempting, but wrong: if you compare the performance of this
version of the program to that of the original, you will find that
this one is significantly slower. The reason is that the assignment
to `temp` in the middle of the loop above does not create an
expression engine, but instead allocates and fills in an array. Only
by passing an expression to a templated function can a program give
the compiler an opportunity to capture enough information about the
expression to create an array which is bound to an expression engine.

Life would clearly be better if there was some way to declare temporary array variables that were guaranteed to be bound to expression engines, instead of storage engines. However, in order to do this, programmers would have to specify the type of the temporary array exactly. By the time the residual expression above has been expanded, its type definition is several thousand characters long; the complexity of the types of longer expressions grows very, very quickly.

By now, you may be curious about how POOMA does what it does. This section therefore takes a look at the implementation of reduction operators; while many details are omitted, it should give you some idea of how the library is structured, and why some of its features appear the way they do.

We have three requirements for a global reduction function such as
`sum()`: it must be able to reduce arrays of arbitrary size, it
must be able to reduce arrays of arbitrary type, and it must
efficiently use the same underlying machinery as other reductions.
The first two criteria need no justification; the third one is a
software engineering concern. If each reduction function has to be
completely self-contained (i.e., if all of the parallelization and
looping code has to be duplicated), then maintaining the library will
be difficult. On the other hand, we cannot afford to write a generic
reduction routine that takes a function pointer or an object with a
virtual method as an argument, since the cost of indirection inside an
inner loop is unacceptable.

POOMA's authors solve these problems by using a *trait
class* to represent each primitive reduction operation. A trait
class is a class whose only purpose is to be used to instantiate other
templated classes. Each class in a family of trait classes defines
constants, enumeration elements, and methods with identical names and
signatures, so that they can be used interchangeably.

Trait classes are not part of the C++ language definition, but are
instead a way of using template instantiation as an abstraction
mechanism in yet another way. Instead of overriding virtual functions
inherited from parent classes, templated classes can use constants and
methods supplied by template parameters in generic ways. For example,
either of the classes `Red` and `Green` in the code
below can be used to instantiate `Blue`, but when the
values of those different instances are printed, they display
different values:

struct Red { enum { Val = 123; } }; struct Green { enum { Val = 456; } }; template<class T> class Blue { public: Blue() : val_(T::Val) {} const unsigned int val_; }; int main() { Blue<Red> br; Blue<Green> bg; std::cout << br.val_ << std::endl; std::cout << bg.val_ << std::endl; return 0; }

Note that the example above declares `Red` and
`Green` as `struct`s instead of as `class`es.
The only difference between the two kinds of declarations is that a
`struct`'s members are `public` by default, while those
of a `class` are `private`. This saves one line each in
the definitions of `Red` and `Green`.

POOMA defines a family of trait classes representing the C++ assignment operators. These classes, which are part of the Portable Expression Template Engine (PETE), are defined as follows:

PETE_DEFINE_ASSIGN_OP((a = b), OpAssign) PETE_DEFINE_ASSIGN_OP((a += b), OpAddAssign) PETE_DEFINE_ASSIGN_OP((a -= b), OpSubtractAssign) PETE_DEFINE_ASSIGN_OP((a *= b), OpMultiplyAssign) PETE_DEFINE_ASSIGN_OP((a /= b), OpDivideAssign) PETE_DEFINE_ASSIGN_OP((a %= b), OpModAssign) PETE_DEFINE_ASSIGN_OP((a |= b), OpBitwiseOrAssign) PETE_DEFINE_ASSIGN_OP((a &= b), OpBitwiseAndAssign) PETE_DEFINE_ASSIGN_OP((a ^= b), OpBitwiseXorAssign) PETE_DEFINE_ASSIGN_OP((a <<= b), OpLeftShiftAssign) PETE_DEFINE_ASSIGN_OP((a >>= b), OpRightShiftAssign)

where each use of the macro `PETE_DEFINE_ASSIGN_OP` defines
a `struct` with the same members:

#define PETE_DEFINE_ASSIGN_OP(Expr,Op) \ struct Op { \ Op() {} \ \ Op(const Op&) {} \ \ enum { \ tag = BinaryUseLeftTag \ }; \ \ template<class T1, class T2> \ inline typename BinaryReturn<T1, T2, Op>::Type_t \ operator()( \ T1& a, \ const T2& b \ ) const { \ return Expr; \ } \ };

This `struct` has four members: a default constructor, a
copy constructor, a constant (defined using an `enum`) called
`tag`, and a templated method `operator()`. (The
default and copy constructors might appear to be unnecessary, but
omitting them results in Uninitialized Memory Read (UMR) warnings from
memory checking tools such as Purify.) As discussed
earlier, templated methods
are instantiated when and as required, just like templated functions,
but are still members of their containing class. In this case, the
templated `operator()` takes a destination argument `a`
of one type, and a source argument `b` of another type, and
performs the expression `Expr` (such as "`a+=b`") on
them.

Thus, whenever `OpAddAssign` or another class in this family
of trait classes is used in an expression, the templated method
`operator()` is instantiated with the appropriate types (such
as `int` for the source, and `double` for the
destination). Since this method is not virtual, there is no
abstraction penalty: code using any particular instantiation of this
templated method will run at maximum speed.

The only feature of this macro that has not yet been explained is
the use of `BinaryReturn`. This is another trait
class, whose only purpose is to define the type of the result of
applying an operation `Op` to values of types `T1` and
`T2`. For example,
`BinaryReturn<int,float,OpAdd>` defines `Type_t`
to be `float`, while
`BinaryReturn<double,float,OpMul>` defines
`Type_t` to be `double`. Again, POOMA uses
template instantiation as an abstraction mechanism, so that logically
repetitive code does not have to be physically replicated. (This is
an illustration of how to solve the problem discussed earlier of how
to generalize the return type of a function so that it is adequate for
the input argument types.)

With all of this machinery in place, `sum()` is now easy to
build:

template<int Dim, class T, class EngineTag> inline T sum( const ConstArray<Dim, T, EngineTag>& a ){ return globalReduction(a, T(0), OpAddAssign()); }

The function just passes the array, a value to initialize the sum
with, and the reduction operation to be applied to a generic templated
function called `globalReduction()`. The first argument to
this function is the array; the second is an initialization value
(which is also the value returned if the array is empty), and the
third specifies the operation to apply. By the time
`globalReduction()` is expanded,
`OpAddAssign::operator()` has been inlined by POOMA's
expression template machinery. Note that the initialization value must
be an identity element for the operation; while zero works in most
cases, some operations (such as logical and bitwise OR) must use other
values.

POOMA achieves high performance using expression engines,
which are constructed automatically during compilation, and which
evaluate complex array expressions on demand in order to avoid
creation of temporary arrays. In order to support mixed data types,
and the use of both arrays and array expressions as arguments,
user-defined functions should be templated separately by both the data
type and engine type of all of their arguments. Unfortunately, C++
does not support templatization on function return type, which can
make it difficult to write fully-generic functions. Finally,
POOMA provides several built-in reduction functions, such as
summation, product, and logical combination. These are implemented
using a generic framework, which can be extended by knowledgeable
users.

[Prev] | [Home] | [Next] |