Compute Layer

Overview

The Compute layer contains the functions that perform the actual processing. All functions in this layer are prefixed with rc_ for RAPP Compute, separating them from the API functions that all start with rapp_. A function may have two different implementations, a generic and a vector version. The generic implementation can run on any platform, but the vector version requires a platform-specific vector backend.

Common Functionality

There is a small layer of hardware abstractions to shield the implementations from all the platform-specific details:

rc_platform.h: Platform-specific constants such as native word size, endianness and alignment.
rc_stdbool.h: Portable C99 stdbool.h.
rc_word.h: Word operations independent of endianness and native word size.

There is also some platform-independent functionality:

rc_util.h: Common utilities such as MIN(), MAX(), CLAMP().
rc_table.h: Static lookup tables used in more than one place.

The common functionality is located in the compute/common directory, except rc_platform.h and rc_stdbool.h that are exported to the API layer, and thus reside in compute/include.

Compute API

All headers for the Compute layer API are located in the compute/include directory. The header rappcompute.h includes everything that is exported, so the API layer should only include this file.

The interface of the Compute layer is:

rc_stdbool.h: Portable C99 stdbool.h.
rc_platform.h: Platform-specific constants.
rc_malloc.h: Aligned memory allocator.
rc_bitblt_wm.h: Bitblit, word-misaligned (bit-level).
rc_bitblt_wa.h: Bitblit, word-aligned.
rc_bitblt_vm.h: Bitblit, vector-misaligned (byte-level).
rc_bitblt_va.h: Bitblit, vector-aligned.
rc_pixop.h: Pixelwise arithmetic operations.
rc_type.h: Type conversions.
rc_thresh.h: Thresholding.
rc_stat.h: Sum and sum-of-squares statistics.
rc_moment_bin.h: Binary image moments.
rc_filter.h: Fixed-filter convolutions.
rc_morph_bin.h: Binary morphology primitives.
rc_pad.h: 8-bit padding.
rc_pad_bin.h: Binary padding.
rc_reduce.h: 8-bit 2x spatial reduction.
rc_reduce_bin.h: Binary 2x spatial reduction.
rc_expand_bin.h: Binary 2x spatial expansion.
rc_rotate.h: 8-bit 90 degree rotation.
rc_rotate_bin.h: Binary 90 degree rotation.
rc_margin.h: Binary image logical margins.
rc_crop.h: Binary region cropping.
rc_fill.h: Connected-components seed fill.
rc_contour.h: Contour chain code generation.
rc_rasterize.h: Chain code line rasterization.
rc_cond.h: Conditional pixelwise operations.
rc_gather.h: Conditional 8-bit gather.
rc_gather_bin.h: Conditional binary gather.
rc_scatter.h: Conditional 8-bit scatter.
rc_scatter_bin.h: Conditional binary scatter.
rc_integral.h: 8-bit to 8, 16, and 32-bit integral images.
rc_integral_bin.h: Binary to 8, 16, and 32-bit integral images.

Implementation Principles

Both the generic and the vector implementations follow two basic rules:

Minimize the amount of redundant code.
Minimize the use of conditional compilations with #ifdef.

This has two implications. First, preprocessor macros are used a lot. In the word and vector interfaces, it is for efficiency reasons (inlining), but in the actual function implementations they serve the purpose of templates. These template macros perform everything in common for a family of functions, and accept other macros as arguments for altering the actual computation performed in the inner loop. For example, the generic implementation of the double-operand pixelwise arithmetic operations are almost identical for all functions – they differ only in the arithmetic operation performed. The template macros are usually private to the source file, but the thresholding templates are not since they are used for both thresholding and u8-to-binary conversion.

The second implication is the use of conditional if/else instead of preprocessor #ifdef. We rely on the compiler to optimize out branches that are never taken, i.e. the condition is a compile-time constant. Usually the conditions are related the word and vector sizes, and also to loop unrolling factors.

Generic Implementation

The generic implementations are located in the directory compute/generic. They have access to all the common functionality. To make things easier to test and maintain, one should try to avoid using the RC_BIG_ENDIAN/RC_LITTLE_ENDIAN constants, and instead rely on the word interface.

All functions in the Compute layer API must have a generic implementation as a fallback, with one exception. The vector-aligned and vector-misaligned bitblits are only available in a vector version, since they can be implemented with the soft-SIMD (SWAR) vector backend on all platforms. Using soft-SIMD, they essentially degenerate into their word-aligned equivalents.

Vector Implementation

The vector implementations are located in the directory compute/vector. They also use the common functionality as the generic implementations, but also have access to the vector interface. An implementation of this interface is called a vector backend.

The vector interface is not restricted to SIMD-capable architectures. There is a soft-SIMD backend that emulates SIMD operations in software, on top of the word interface.

A particular vector backend may not implement all operations specified by the interface. The vector implementations must therefor protect the functions using #ifdef conditional on the vector operations being used. This way only the functions where all prerequisites in terms of vector operations are fulfilled, will actually be compiled.

Selecting Implementation

There is a mechanism for selecting implementations automatically. The header file rc_impl_cfg.h provide two macros, RC_IMPL(function-name, unrollable), and RC_UNROLL(function-name). The first macro expands to either 0 or 1 and can be tested with #if. The second one expands to an unroll factor 1, 2 or 4. The following example demonstrates the use:

  #if RC_IMPL(rc_example_u8, 1)
  void
  rc_example_u8(uint8_t *buf, int len)
  {
      int i;
      for (i = 0; i < len;) {
          buf[i] = ~buf[i]; i++;
          if (RC_UNROLL(rc_example_u8) >= 2) {
              buf[i] = ~buf[i]; i++;
          }
          if (RC_UNROLL(rc_example_u8) == 4) {
              buf[i] = ~buf[i]; i++;
              buf[i] = ~buf[i]; i++;
          }
      }
  }
  #endif

This mechanism must be used if one of the following apply:

There is more than one implementation of the same function (generic/vector).
An implementation is unrollable, i.e. uses the RC_UNROLL() facility.

There is one version of the rc_impl_cfg.h file for generic implementations and one for vector implementations, located in compute/generic and compute/vector, respectively. They define the RC_IMPL() and RC_UNROLL() macros differently based on the content of a platform-specific configuration header called rapptune.h. This header is generated automatically by the tuning process.

Internal References

RAPP is not intended to contain layers of functionality, but sometimes it is necessary to call a RAPP function in the implementation of another. Do not use the internal rc_ -name directly, if either function has implementations selected by RC_IMPL because then tuning will malfunction or yield incorrect results. Instead, please use rc_stat_max_bin and its double rc_stat_max_bin__internal as a template, and how they are used in RC_INTEGRAL_SUM_BIN in compute/generic/rc_integral_bin.c.

Influential Definitions

There are a few preprocessor definitions that affect the implementation. They are to be defined only by the build system.

RAPP_USE_SIMD
Use the SIMD vector backend instead of the SWAR one. The --enable-backend configure-time option determines what backend to use.

RAPP_FORCE_GENERIC
Force the generic implementations to be used everywhere, overriding the configuration in rapptune.h.

RAPP_FORCE_SWAR
Force the vector implementations with the SWAR (soft-SIMD) vector backend to be used everywhere, overriding the configuration in rapptune.h.

RAPP_FORCE_SIMD
Force the vector implementations with the SIMD vector backend to be used everywhere, overriding the configuration in rapptune.h.

RAPP_FORCE_UNROLL={1,2,4}
Force all implementations to use a specific unroll factor, overriding the configuration in rapptune.h.

RAPP_FORCE_SIZE={2,4,8}
Force the default word size to be a specific value instead of the native machine word size.

RAPP_FORCE_EXPORT
Force all Compute layer API symbols to be exported in the final library, instead of the default hidden visibility.

The RAPP_FORCE family of parameters are for special purposes, such as tuning and regression tests, and are not used when building the final library.

Next section: Vector Abstraction Layer