 Contents:
Introduction
Notation
Example
Summary

## Introduction

Indirect addressing is a fundamental operation in many numerical and scientific algorithms. Instead of accessing array elements with loop indices directly, indirect addressing uses the element of one array (sometimes called an index table or indirection table) as indices for another array. These index tables can store either static information (such as the neighbors of points in an unstructured mesh), or dynamic information (such as the sorting order of the elements in a vector).

This tutorial shows how to perform indirect addressing in POOMA, and discusses some of the subtleties and complexities that arise when indirection and multithreading are combined.

## Notation

Suppose that the array X contains the following values:

 a b c d

while the array J contains:

 3 0 1 2

A program could re-order the elements of X while copying them into another array Y using the following loop:

```for (int i=0; i<4; ++i)
{
Y = X(J(i));
}
```

The effect of this would be to fill Y with the following values:

 d a b c

POOMA allows this operation to be expressed more economically, simply by using the array J as a subscript on the array X directly:

```Y = X(J);
```

Indirect addressing on the source side of an assignment is sometimes called "pull" addressing, since the index array's values are being used to "pull" values from the source into the destination. POOMA also supports "push" addressing, in which the index array is used on the destination side of the assignment. The syntax for this is simply:

```Y(J) = X;
```

which is equivalent to the loop:

```for (int i=0; i<4; ++i)
{
Y(J(i)) = X;
}
```

This operation would fill Y with:

 b c d a

So long as J is a strict permutation of the indices 0...N-1, this will have the same effect as the loop shown above. The effects of this statement if J has repeated or missing elements are discussed in the next section.

One-dimensional arrays of integers can be used as subscripts to one-dimensional arrays of any other type, but how are multi-dimensional arrays to be subscripted? In POOMA, the answer is to use arrays whose elements are of type Loc. An Array<Loc<2>>, for example, can be used to re-order the elements of a two-dimensional array, since each element of the index array can act as a coordinate in the data array. Similarly, an Array<Loc<3>> can be used to subscript a 3-dimensional array of any type. Future releases of POOMA will support higher-dimensional indirect addressing as well.

Indirect addressing is a very powerful tool, but must be used carefully. The most important consideration is that the order of data movement during indirection is not defined. If indirection is performed using an index table that sends many values to a single location, for example, then there is no way to predict which of those values will be written into that location.

However, indirect addressing can always be used to read values safely, and to thereby perform a scatter operation. Suppose a source array S contains the values [3.14, 2.71], while an index array IA contains the values [0,1,0,0,1,1]. The expression S(IA) yields:

```[3.14, 2.71, 3.14, 3.14, 2.71, 2.71]
```

and can always be used safely on the right-hand side of an expression. This works because the domain of a(b) is the domain of b. In expressions like a(b) = c, the domains of c and b have to match, but the domain of a can be arbitrary.

## Example

The example for this tutorial is a 1-dimensional Fast Fourier Transform (FFT) that shuffles data using indirect addressing. This FFT implementation is not efficient---it recalculates trigonometric constants repeatedly, for example, rather than pre-calculating and storing them---but it does illustrate the power of indirection.

The source for this example is included in the release in the file examples/Indirection/FFT/FFT.cpp. The main body of this program initializes POOMA, creates and fills an array of complex values, transforms it, and prints the result of that transform, as shown below:

```138  int main(int argc, char* argv[])
139  {
140    Pooma::initialize(argc, argv);
141
142    int size = 16;
143
144    Array<1, complex<double>, Brick> a(size);
145
146    int i;
147    for (i = 0; i < size; ++i)
148    {
149      a(i) = sin(4*i*Pi/size);
150    }
151
152    std::cout << a << std::endl;
153
154    fft(a);
155
156    std::cout << a << std::endl;
157
158    Pooma::finalize();
159    return 0;
160  }
```

The key statement is on line 154, where the fft() function is invoked. The program contains two overloaded versions of this function. The first version, shown below, determines the level of the FFT (i.e. the number of recursive steps the calculation requires), then invokes the second version, which actually performs the calculations. (Note that in this simple example, the input array's length is required to be a power of two. Also, Pi is defined as a static const double equal to the value of pi computed from the expression acos(-1.0) because some compilers do not define mathematical constants such as M_PI in the <math.h> header file.)

```117  void fft(const Array<1, complex<double>, Brick> &array)
118  {
119    int size = array.domain().size();
120
121    // Determine size as power of 2
122    int level = -1;
123    while (size > 1)
124    {
125      PAssert(!(size & 1));
126      ++level;
127      size /= 2;
128    }
129
130    if (level >= 0)
131      fft(array, level);
132  }
```

The second version of fft() does the real number-crunching. If the computation has reached its final stage, odd and even elements are combined directly (lines 106-111). If the computation is still recurring, the elements are shuffled, a half-sized transform is applied on each subsection, and the results are combined (lines 100-102). All of these operations use indirect addressing to move data values around. Most of the rest of the program can be viewed as infrastructure needed to make this data movement simple and efficient.

```083  void fft(const Array<1,complex<double>,Brick> &array, int level)
084  {
085    Interval<1> domain = array.domain();
086
087    if (level > 0)
088    {
089      ConstArray<1, int,             IndexFunction<LeftMap> >    left(domain);
090      ConstArray<1, int,             IndexFunction<RightMap> >   right(domain);
091      ConstArray<1, int,             IndexFunction<ShuffleMap> > shuffle(domain);
092      ConstArray<1, complex<double>, IndexFunction<TrigFactor> > trig(domain);
093
094      left.engine().setFunctor(LeftMap(level));
095      right.engine().setFunctor(RightMap(level));
096      shuffle.engine().setFunctor(ShuffleMap(level));
097      trig.engine().setFunctor(TrigFactor(level));
098
099      // Shuffle values, compute n/2 length ffts, combine results.
100      array = array(shuffle);
101      fft(array, level-1);
102      array = array(left) + trig*array(right);
103    }
104    else
105    {
106      int size = domain.size();
107      Range<1> left (0, size-2, 2),
108               right(1, size-1, 2);
109
110      array(left) += array(right);
111      array(right) = array(left) - 2.0 * array(right);
112    }
113  }
```

The shuffling step on line 100 uses an indirection array called shuffle to pull values into the right positions. This array, which is declared on line 91, is a ConstArray of integers. Instead of storing the values, however, the array calculates them on the fly using an IndexFunction engine, which is bound to the array on line 96. The IndexFunction engine works as one would expect: having been specialized with a user-defined class with an overloaded operator(), the engine transforms an index i into some new value by calling that operator(). In this case, the specializing class is ShuffleMap, which is shown below:

```003  struct ShuffleMap
004  {
005    ShuffleMap(int n = 0)
006      : degree_m(n)
007    {
008      nbit_m = 1 << n;
009      mask1_m = nbit_m - 1;
011    }
012
013    int operator()(int i) const
014    {
015      return
017        | ( (mask1_m & i) << 1 )
018        | ( (nbit_m & i) ? 1 : 0 );
019    }
020
022  };
```

Similar engines are used to select and combine elements of the arrays after the sub-FFTs have been performed. These use the overloaded operator() in the classes LeftMap and RightMap, shown below:

```029  struct LeftMap
030  {
031    LeftMap(int n = 0)
032      : nbit_m(~(1 << n))
033    { }
034
035    int operator()(int i) const
036    {
037      return (nbit_m & i);
038    }
039
040    int nbit_m;
041  };
042
043  struct RightMap
044  {
045    RightMap(int n = 0)
046      : nbit_m(1 << n)
047    { }
048
049    int operator()(int i) const
050    {
051      return (nbit_m | i);
052    }
053
054    int nbit_m;
055  };
```

Finally, an IndexFunction engine is also used to calculate the trigonometric weights used in combining. This IndexFunction is an extreme example of trading time for space: it does not store anything, but repeatedly recalculates factors on demand.

```065  struct TrigFactor
066  {
067    TrigFactor(int n = 0)
068      : n_m(1 << n)
069    { }
070
071    complex<double> operator()(int i) const
072    {
073      return complex<double>(cos(Pi*i/n_m), sin(Pi*i/n_m));
074    }
075
076    int n_m;
077  };
```

## Summary

Efficient support for indirect addressing---the use of the values in one array to change the way another array's elements are accessed---is one of the features that characterizes full-strength numerical libraries. This release of POOMA supports indirect addressing in both "push" and "pull" modes using conventional data-parallel syntax, without compromising the performance of regular index operations.