 # POOMA Tutorial 1A Laplace Solver Using Simple Jacobi Iteration

Contents:
Introduction
Laplace's Equation
A Sequential Solution
Using Intervals
Some Refinements
A Note on Affinity
Summary

## Introduction

This tutorial introduces two of the most commonly used classes in POOMA: Array, which is used to store data, and Interval, which is used to specify a subsection of an array. The key ideas introduced in this tutorial are:

• the use of whole-array operations, such as scalar-to-array assignment and elementwise addition; and
• the use of intervals to specify array sections.

## Laplace's Equation

Our first POOMA program solves Laplace's equation on a regular grid using simple Jacobi iteration. Laplace's equation in two dimensions is:

d2V/dx2 + d2V/dy2 = 0

where V is, for example, the electric potential in a flat metal sheet. If we approximate the second derivatives in X and Y using a difference equation, we obtain:

V(i, j) = (V(i+1, j) + V(i, j+1) + V(i-1, j) + V(i, j-1)) / 4

i.e. the voltage at any point is the average of the voltages at neighboring points. This formulation also gives us a way to solve this equation numerically: given any initial guess for the voltage V0, we can calculate a new guess V1 by using V0 on the right hand side of the equation above. We can then use the calculated V1 to calculate a new guess V2, and so on.

This process, called Jacobi iteration, is the simplest in a family of relaxation methods than can be used to solve a wide range of problems. All relaxation methods iterate toward convergence, and use some kind of nearest-neighbor updating scheme, or stencil. The stencil for Jacobi iteration, for example, consists of five points arranged in a cross; other, larger stencils lead to different update rules, and different convergence rates. One of the main goals of POOMA was to make it easy for programmers to specify and implement stencil-based algorithms of this kind.

If we add charged particles to the system, we obtain Poisson's equation:

d2V/dx2 + d2V/dy2 = ß

where ß specifies the charge distribution. The solution to this equation can also be calculated using a relaxation method such as Jacobi iteration; the update equation is:

V(i, j)new = (V(i+1, j) + V(i, j+1) + V(i-1, j) + V(i, j-1) - ß(i, j))/4

## A Sequential Solution

Our first version of Jacobi iteration models a flat plate with a unit charge in its center using a 20×20 array. It uses POOMA's arrays conventionally, by looping over their elements, and is included in the release as examples/Solvers/Sequential. There is nothing wrong with using the library this way---POOMA's arrays are still very fast, and memory-efficient---but when compared with the refined program shown later, this code is longer and harder to read.

```#include "Pooma/Arrays.h"
#include <iostream>

// The size of each side of the domain.
const int N = 20;

int
main(
int                 argc,           // argument count
char*               argv[]          // argument list
){
// Initialize POOMA.
Pooma::initialize(argc, argv);

// The array we'll be solving for
Array<2> V(N, N);

// The right hand side of the equation (spike in the center)
Array<2> b(N, N);

// Initialize.
for (int i=0; i<N; ++i){
for (int j=0; j<N; ++j){
V(i, j) = 0.0;
b(i, j) = 0.0;
}
}
b(N/2, N/2) = -1.0;

// Iterate 200 times.
Array<2> temp(N, N);
for (int iteration=0; iteration<200; ++iteration)
{
// Use interior of V to fill temp
for (int i=1; i<N-1; ++i){
for (int j=1; j<N-1; ++j){
temp(i, j) = 0.25*(V(i+1,j) + V(i-1,j) + V(i,j+1) + V(i,j-1) - b(i,j));
}
}
// Use temp to fill V
for (int i=1; i<N-1; ++i){
for (int j=1; j<N-1; ++j){
V(i, j) = temp(i, j);
}
}
}

// Print out the result
for (int j=0; j<N; ++j){
for (int i=0; i<N; ++i){
std::cout << V(i, j) << " ";
}
std::cout << std::endl;
}

// Clean up POOMA and report successful execution.
Pooma::finalize();
return 0;
}
```

## Using Intervals

The program shown above is not much of an advance over its C equivalent. The programmer is still required to loop over data elements explicitly, even though these loops all take the same form. A better implementation of Jacobi iteration is shown below (and included in the release as examples/Solvers/SimpleJacobi). This version uses Interval objects to specify index ranges, which eliminates the need for the explicit loops of the first version.

```01  #include "Pooma/Arrays.h"
02
03  #include <iostream>
04
05  // The size of each side of the domain.
06  const int N = 20;
07
08  int
09  main(
10      int                 argc,           // argument count
11      char*               argv[]          // argument list
12  ){
13      // Initialize POOMA.
14      Pooma::initialize(argc, argv);
15
16      // The array we'll be solving for
17      Array<2> V(N, N);
18      V = 0.0;
19
20      // The right hand side of the equation (spike in the center)
21      Array<2> b(N, N);
22      b = 0.0;
23      b(N/2, N/2) = -1.0;
24
25      // Specify the interior of the domain
26      Interval<1> I(1, N-2), J(1, N-2);
27
28      // Iterate 200 times
29      for (int iteration=0; iteration<200; ++iteration)
30      {
31          V(I,J) = 0.25*(V(I+1,J) + V(I-1,J) + V(I,J+1) + V(I,J-1) - b(I,J));
32      }
33
34      // Print out the result
35      std::cout << V << std::endl;
36
37      // Clean up POOMA and report success.
38      Pooma::finalize();
39      return 0;
40  }
```

The first three lines of this program include the header file needed to write POOMA programs, and the standard C++ I/O streams header file. Pooma/Arrays.h includes the header files that define POOMA's arrays. These arrays do not themselves contain data, but instead are handles on data containers through which programs can read, write, and apply operations to N-dimensional sets of elements. As we shall see, it is possible for many arrays to refer to the same underlying data in different ways.

Pooma/Arrays.h also includes all the declarations of the basic POOMA library interface functions. These general routines are used to initialize, query, and shut down the POOMA library environment, including the underlying run-time system.

The next statement in this program, at line 6, defines the size of our problem domain. In order to keep this code simple, this size is made a constant, and the array is made square. Real applications will usually employ variable-sized domains, and put off decisions about the actual sizes of arrays until run-time.

The function Pooma::initialize(), which is called at line 14, initializes some of POOMA's internal data structures. This function looks for certain POOMA-specific arguments in the program's command-line argument list, strips them out, and returns a possibly-shortened list. Programs should call Pooma::initialize() before calling any functions or methods from the POOMA library that might do operations in parallel. (They can alternatively use Pooma::Options, as described in another tutorial.) In practice, this means that it is generally a bad idea to declare POOMA objects as global variables, even if the program is not parallel when it is first written, since their presence can impede future portability.

Line 17 actually declares an array. The first thing to notice is that the rank of the array (i.e., the number of dimensions it has) is a template parameter to the class Array, while the initial dimensions of the array are given as constructor parameters. If we wanted to create a 3-dimensional array, we could change this line to be something like:

```Array<3> V(SizeX, SizeY, SizeZ);
```

When the array V is created, POOMA realizes that some storage has to be created for it as well, and so it creates an actual data area at this point. When the assignment statement on the next line (line 18) is executed, POOMA sees an array target, but a scalar value, so it fills the whole array with the scalar's value.

Lines 21-23 create and initialize the array that stores the charge distribution term ß. This array's values are fixed: there is a single negative unit charge in the center of the domain, and no other charges anywhere else. Note how line 22 uses scalar-to-array assignment, while line 23 assigns to a single element of the array b using conventional subscripting. POOMA's arrays have many advanced features, but they also support mundane operations, such as reading or writing a particular location.

Line 26 introduces the Interval class. An Interval specifies a contiguous range of index values; the integer template argument to Interval specifies the interval's rank, while the constructor arguments specify the low and high ends of the interval's value. Thus, since N is fixed at 20 in this program, both I and J specify the one-dimensional interval from 1 to 19 inclusive.

Intervals are used to select sections of arrays using a Fortran 90-like syntax. Intervals and integers may be freely mixed when indexing an array; if any index in an expression is an interval, the result is a temporary alias for the specified array section. This alias is itself an array, since arrays are just lightweight handles on underlying data storage objects. The expression V(I,J) therefore returns a temporary array which aliases the interior of the same storage used by the array V.

Note that since the array V is square, the program could have declared a single Interval spanning 1..N-2, and used it to index V along both axes. However, the code is easier to read, and easier to modify to handle non-square domains, if two separate Intervals are used.

The loop spanning lines 29-32 performs the Jacobi relaxation. As discussed earlier, this consists of repeatedly averaging the charge distribution b at each location, and the values in V that are adjacent to that location, and then updating the location with that average. These calculations are done in parallel; that is, they appear to be calculated simultaneously for all elements in the array. This is accomplished by using the Intervals declared on line 26 to select sections of V with appropriate offsets, and then relying on overloaded addition and subtraction operators to combine these sections. For example, the expression V(I,J+1) selects those elements V(i,j) of V for which i is in the range 1..N-2, and j is in the range 2..N-1 (i.e., the domain is offset in the second dimension by one). As can be inferred, arithmetic on Intervals works in the obvious way: for example, adding an integer adjusts all the elements of the Interval upward or downward.

Note that the assignment on line 31 automatically creates a temporary copy of the array V, so that values are not read while they are being overwritten. POOMA automatically detects cases in which the stencils on the reading side of an assignment overlap the stencils on the assignment's writing side, and creates temporaries as needed to avoid conflicts. The program shown in the next tutorial avoids the creation of temporaries simply by using non-overlapping stencils.

The statement on line 35 prints out the whole of the array V. POOMA overloads the usual stream operators (<< and >>) to handle most of the objects in the library sensibly. In this case, the output expression prints the elements of V a row at a time, putting each row on a separate line. Finally, line 38 calls the cleanup routine Pooma::finalize(), which complements the earlier call to Pooma:initialize() on line 14, and returns 0 to indicate successful completion.

## Some Refinements

One thing that isn't shown in the program above is the precision of the calculations. To find out what this is, we can inspect the declaration of the class Array in the POOMA header file Array.h:

```template < int Dim,
class T         = POOMA_DEFAULT_ELEMENT_TYPE,
class EngineTag = POOMA_DEFAULT_ENGINE_TYPE >
class Array : ...
{
...
};
```

The class Array has three template parameters: the number of dimensions, the element data type, and an engine tag that specifies how the underlying data is actually stored. We will discuss engines and engine tags in more detail in subsequent tutorials.

What makes this templated class declaration different from others we have seen so far is that default values are supplied for two of its three type parameters. The macros POOMA_DEFAULT_ELEMENT_TYPE and POOMA_DEFAULT_ENGINE_TYPE are defined in the header file PoomaConfiguration.h. The first specifies the default element type of arrays, while the second specifies their default storage mechanism. The default for the first is double, while the default for the second specifies dense, rectangular storage.

There are therefore two ways to change the precision of the calculations in the program above. One is to re-define POOMA_DEFAULT_ARRAY_ELEMENT_T:

```#undef POOMA_DEFAULT_ELEMENT_TYPE
#define POOMA_DEFAULT_ELEMENT_TYPE float
#include "Pooma/Arrays.h"
```

The "undefinition" is needed because some compilers automatically read a "prefix file" before any other headers. This #define must come before any of POOMA's header files are included to ensure that all instantiations of all POOMA classes are done with the same default in effect.

The second, and more modular, way to change the precision of this Laplace solver is to specify the data types of the arrays explicitly:

```    Array<2, float> V(N, N);
Array<2, float> b(N, N);
```

This is generally considered better practice, as it is clear at the point of declaration what the data type of each array is, and because it makes it easier for programmers to combine classes that have been written independently. Other aspects of POOMA can and should be changed in the same way. (For example, the default engine type could be re-defined to make parallel evaluation the default.)

It is also generally considered good practice to use typedefs to ensure the consistency of array definitions. For example, the Laplace solver could be written as follows:

```    typedef double LaplaceDataType_t;
typedef Array<2, LaplaceDataType_t> LaplaceArrayType_t;
LaplaceArrayType_t V(N, N);
LaplaceArrayType_t b(N, N);
```

Declaring types explicitly in this way might seem unnecessarily fussy in a small program such as this. However, all programs have a tendency to grow, and finding and modifying dozens of object declarations after the fact is much more tedious and error-prone than defining a type once, in one place, and then using it consistently through the rest of the program.

One final note on this program: it might seem cumbersome to declare the array on line 17, then initialize it with an assignment on the next line, instead of providing an initial value for the array's elements with an extra constructor argument. POOMA requires this in order to avoid ambiguity regarding what is a dimension, and what is an initial value. Since a single templated class Array is used for arrays of any dimension up to seven, it must provide constructors taking up to seven arguments which between them specify the array's dimensions. If we let Sub1, Sub2, and so on represent classes that can legally be used to specify dimensions (such as int or Interval), then Array must have constructors like the ones shown below:

```template < int Dim,
class T         = POOMA_DEFAULT_ELEMENT_TYPE,
class EngineTag = POOMA_DEFAULT_ENGINE_TYPE >
class Array : ...
{
public :
template<class Sub1>
Array(const Sub1& s1);

template<class Sub1, class Sub2>
Array(const Sub1& s1, const Sub2& s2);

template<class Sub1, class Sub2, class Sub3>
Array(const Sub1& s1, const Sub2& s2, const Sub3& s3);

etc.
};
```

Suppose that Array also included constructors that took an initial value for the array's elements as an argument:

```template < int Dim,
class T         = POOMA_DEFAULT_ELEMENT_TYPE,
class EngineTag = POOMA_DEFAULT_ENGINE_TYPE >
class Array : ...
{
public :
template<class Sub1>
Array(const Sub1& s1, T initial_value);

template<class Sub1, class Sub2>
Array(const Sub1& s1, const Sub2& s2, T initial_value);

etc.
};
```

Declarations such as the following would then be ambiguous:

```Array<2, int> w(8, 5);
```

since the compiler would not be able to tell whether the two arguments were to be interpreted as dimensions, or as a dimension and an initializer. If C++ provided a way to "hide" constructors based on the value of a template argument, so that only the constructors for N-dimensional arrays could be called for Array<N>, this problem wouldn't arise. Since there is no such mechanism, POOMA requires programmers to specify initial values by wrapping them in a templated class. This is done as shown in the following declaration:

```Array<2> w(5, 7, modelElement(3.14));
```

The function modelElement() does nothing except return an instance of ModelElement<T>, where T is the type of modelElement()'s argument. The ModelElement class in its turn only exists to provide enough type information for the compiler to distinguish between initializers and dimensions; the corresponding constructors of Array are:

```template < int Dim,
class T         = POOMA_DEFAULT_ARRAY_ELEMENT_T,
class EngineTag = POOMA_DEFAULT_ARRAY_ENGINE >
class Array : ...
{
public :

// constructors for 1-dimensional arrays
template<class Sub1>
Array(const Sub1& s1);

template<class Sub1>
Array(const Sub1& s1, ModelElement<T> initial_value);

// constructors for 2-dimensional arrays
template<class Sub1, class Sub2>
Array(const Sub1& s1, const Sub2& s2);

template<class Sub1, class Sub2>
Array(const Sub1& s1, const Sub2& s2, ModelElement<T> initial_value);

etc.
};
```

Note that the function modelElement() is just a programming convenience: its only real purpose is to save programmers the trouble of typing:

```Array<2> w(5, 7, ModelElement<double>(3.14));
```

## A Note on Affinity

In some shared-memory machines, such as SGI Origins, every processor can access memory everywhere in the machine, but there is still a difference between "local" and "remote" memory. The memory chips are physically located with particular processors, so when processor 0 accesses memory that is actually stored with processor 127, the access is on average about 3-4 times slower than if processor 0 accesses its own memory. This only arises on very large machines---computers with up to 8 processors generally have truly symmetric memory.

When a program dynamically allocates memory on such a machine, the pages get mapped into the memory that is located with the CPU that first touches the memory. That is not necessarily the CPU that requested the allocation, since many pages could be allocated in one logical operation and pointers to them could be handed to other CPUs before being dereferenced.

Thus, both memory and threads can have an affinity for particular processors. A chunk of memory has affinity for a particular CPU, and the thread scheduler can give a thread affinity for a CPU.

The difficulty that arises is that if the thread that is running the user's code initializes the memory for an Array with the modelElement() function mentioned in the first tutorial, all of the memory gets mapped to the CPU where that thread is running, instead of to a CPU across the machine.

One solution to this problem would be for the constructor that takes a ModelElement to generate the iterates that fill the memory, and then farm them out to the proper threads, so that the memory is mapped where the program actually wants it. This optimization is not in this release of POOMA, but will be considered for future releases.

## Summary

This tutorial has shown that POOMA's Array class can be indexed sequentially, like a normal C or C++ array. It can also be indexed using Interval objects, each of which specifies a contiguous range of indices. When an Array is indexed using an Interval, the result itself acts like an array. Overloaded operators can be used to perform arithmetic and assignment on both arrays and selected array sections. Finally, the elementary data type of arrays can be changed globally by redefining a macro, or for individual arrays by overriding the default value of the Array template's second type parameter. The latter is considered better programming practice, particularly when typedef is used to localize the type definition.