Fixmath  1.4
Fixmath User's Manual

# Introduction

Fixmath is a library of fixed-point math operations and functions. It is designed to be portable, flexible and fast on platforms without floating-point support:

• The library is written in ANSI-C (C89).
• The library is thread-safe.
• There is no hard-coded fixed-point format. The user can set the binary point freely on a per-operation basis.
• Simple arithmetic operations are inlined.
• Performance is favoured over last-bit precision.

# The Fixed-Point Representation

All fixed-point numbers are represented as 32-bit signed integers. Since the numbers are signed, we think of the most significant bit as the sign bit, even though it is not a true sign bit in two's complement representation. For a fixed-point number, we also have a binary point, separating the integer bits from the fraction bits. If we have n integer bits and m fraction bits, the format is

x = s n.m,

where s is the sign bit, and we have the constraint n + m = 31. In this library the fixed-point formats are often written as Qn.m or Q.m. The binary point is implicit in the sense that it is not stored in the number itself. It is up to the user to keep track of the binary points in different computations.

For clarity, there is a fixed-point type fixed_t that is always a 32-bit signed integer. Other integers may also be used, but they will be converted to 32-bits before the operation. There is also a type for representing reciprocal values, fx_rdiv_t, see the section on reciprocal value division.

# Naming Conventions

All functions are prefixed with fx_. Otherwise, the names follow the naming convention of the math library in C99 as much as possible. This means that functions are named as their floating-point equivalents, with a suffix x for fixed-point.

A fixed-point number is generally specified as a representation xval, and the number of fraction bits (binary point), frac.

# Precision

The error bound is measured in ulp, unit in the last place, which is the smallest representable value. The error for the conversions and the arithmetic operations is within 0.5 ulp. For the remaining math functions the error is higher. The error bounds for these functions are determined empirically using random tests.

Forcing all math functions to be correct to 0.5 ulp would result a severe performance hit. In normal use cases it is generally acceptable to loose one or two bits of precision if this makes the operation several times faster. If a higher precision is needed, choose a fixed-point format with more fraction bits.

# Special Numbers

There are no special number such as NaN and Inf in the IEEE-754 floating-point standard. For any fixed-point format, all possible bit patterns in the 32-bit representation are valid numbers.

# Saturating Arithmetics

There is currently no support for saturating arithmetics. If the computed result is outside the representable range, it will wrap around.

# Integer Operations

Some fixed-point operations are independent of the number of fraction bits. This means that they are the same as in the special case of zero fraction bits, i.e. integers. Since the fixed-point numbers are represented as integers, the normal integer operations in C may be used in these cases. This applies to the following:

• Zero: integral 0 is always 0 in all fixed-point formats.
• Tests: ==, !=, <, <= > and >= can be used if both operands have the same binary point.
• Addition: +, +=, -, -= can be used if both operands have the same binary point.
• Sign: unary + and - always work.

# Error Handling

No checks or error handling is performed by the conversions and the arithmetic operations. In most operations an error condition will just produce an erroneous result, but fx_divx() and fx_rdivx() may trigger divide-by-zero and NULL pointer reference errors, respectively.

The math functions check the arguments and sets the errno variable on errors, similar to what the corresponding functions in the C math library does.

For completeness, fx_addx() and fx_subx() are also provided for addition and subtraction. They are equal to the integer operations + and -, respectively.

# Performance

## Fixed-Point Formats

For optimal performance, the frac arguments should be constants in all conversions and arithmetic operations.

## Constants

Using fx_ftox() and fx_dtox() with constant arguments will generate a fixed-point compile-time constant. No run-time floating-point operations will be performed.

## Math Functions

Use the base-2 logarithm and exponential functions instead of the natural or base-10 versions. It is also more efficient to compute 2ylog2(x) instead of using fx_powx(). This is because of extra overhead imposed by the normalization and the checks that must be done in order to minimize the loss of precision in the intermediate results. The user who knows the data range and fractional format can cut corners here.

## Benchmarks

There is a benchmark application in the prof directory. It runs performance tests for all API functions and some floating-point equivalents. The result is of course platform-dependent, but for CPUs without floating-point support in hardware, the fixed-point arithmetic operations are around 10 times faster than the corresponding software-emulated floating-point operations. For many of the math functions, the speed-up is around 100 times.

# Implementation

All operations are implemented as functions. This makes it possible to take the address of the function for all operations. However, some functions may also be implemented as macros, and they may evaluate the frac argument twice. For this reason it is an error to call a function with a frac argument that has side effects, e.g. frac++. The conversion operations fx_ftox() and fx_dtox() may also evaluate floating-point argument twice.

# Portability

The library is written in ANSI-C, but it requires the C99 stdint.h header. The correctness tests require also the C99 stdbool.h header and the Check library . The benchmark application, which is not built by default, requires the POSIX sigaction() and setitimer() facilities.

# Contents

The library contain the following functionality:

# Usage

## Including and Linking

To use the fixmath library, include the header <fixmath/fixmath.h>, and link the application with -lfixmath, or use pkg-config to determine the compiler and linker flags.

## Basic Example

Assume that we want to compute the 2-norm of a data vector, i.e. All data points are in the range [0-1]. The floating-point implementation is shown below.

* float norm2(float *buf, int len)
* {
* float sum = 0.0f;
* int k;
* for (k = 0; k < len; k++) {
* float val = buf[k];
* sum += val*val;
* }
* return sqrtf(sum);
* }
*

When converting the operation to fixed-point, we need to choose a fixed-point format, i.e. the number of integer and fraction bits. It is of course dependent on the range of the data set. In this example, we use 16 bits for the fraction part, i.e. Q.16.

* fixed_t norm2(fixed_t *buf, int len)
* {
* fixed_t sum = 0;
* int k;
* for (k = 0; k < len; k++) {
* fixed_t val = buf[k];
* sum += fx_mulx(val, val, 16);
* }
* return fx_sqrtx(sum, 16);
* }
*

## Advanced Use

In the previous example, the data array was given in fixed-point format. Now we want to compute the same result on the original data that is stored as 8-bit unsigned integers. Since the range of the input data now is in the range [0-255], we want to normalize them by dividing with 255.

The implementation below computes the sum of squares as integers. The normalization is performed at by converting the numerator and the denominator to the fixed-point format before computing the normalizing division. Unfortunately, the converted operands are too large for the Q.16 fixed-point format, and will overflow.

* fixed_t norm2(uint8_t *buf, int len)
* {
* fixed_t num;
* fixed_t den;
* int sum = 0;
* int k;
* for (k = 0; k < len; k++) {
* int val = buf[k];
* sum += val*val;
* }
* den = fx_itox(255*255, 16);
* num = fx_itox(sum, 16);
* num = fx_divx(num, den, 16);
* return fx_sqrtx(num, 16);
* }
*

We can solve the problem by performing the fixed-point division and conversion simultaneously. The number of fraction bits of the division result is fx - fy + frac, where f1 and f2 are the number of fraction bits for the first and second operands, respectively. In our case, both f1 and f2 are zero. If we use 16 bits as the frac argument, we will get 16 fraction bits for the result. The complete implementation is shown below.

* fixed_t norm2(uint8_t *buf, int len)
* {
* fixed_t tmp;
* int sum = 0;
* int k;
* for (k = 0; k < len; k++) {
* int val = buf[k];
* sum += val*val;
* }
* tmp = fx_divx(sum, 255*255, 16);
* return fx_sqrtx(tmp, 16);
* }
*

## Reciprocal Value Division

We want to normalize the 8-bit data to the range [0-1] and store it in a Q.16 fixed-point format. To do this, we normalize and convert the data in one step, as shown in the next code example.

* void byte2fix(fixed_t *fix, const uint8_t *byte, int len)
* {
* int k;
* for (k = 0; k < len; k++) {
* fix[k] = fx_divx(byte[k], 255, 16);
* }
* }
*

When benchmarking the above implementation, it turns out to be quite slow. This caused by the per-data point division in the inner loop. In a floating-point implementation one would multiply each data point by the reciprocal value 1/255 instead of dividing by 255. When using fixed-point arithmetics, the reciprocal value may very well be outside the representable range. For this reason there is a special data type for storing reciprocal values, fx_rdiv_t. The functions fx_invx() and fx_isqrtx() both accept an optional fx_rdiv_t argument that can be used for performing multiple division by the same value. In the code below we compute the reciprocal value outside the loop, and then use the fx_rdivx() operation, which is really a multiplication, instead of the fx_div() division operation.

* void byte2fix(fixed_t *fix, const uint8_t *byte, int len)
* {
* fx_rdiv_t rdiv;
* int k;
*
* fx_invx(255, 16, &rdiv);
* for (k = 0; k < len; k++) {
* fix[k] = fx_rdivx(byte[k], &rdiv);
* }
* }
*

# References

 Check, http://check.sourceforge.net