Fixmath
1.4

Fixmath is a library of fixedpoint math operations and functions. It is designed to be portable, flexible and fast on platforms without floatingpoint support:
All fixedpoint numbers are represented as 32bit signed integers. Since the numbers are signed, we think of the most significant bit as the sign bit, even though it is not a true sign bit in two's complement representation. For a fixedpoint number, we also have a binary point, separating the integer bits from the fraction bits. If we have n integer bits and m fraction bits, the format is
where s is the sign bit, and we have the constraint n + m = 31. In this library the fixedpoint formats are often written as Qn.m or Q.m. The binary point is implicit in the sense that it is not stored in the number itself. It is up to the user to keep track of the binary points in different computations.
For clarity, there is a fixedpoint type fixed_t that is always a 32bit signed integer. Other integers may also be used, but they will be converted to 32bits before the operation. There is also a type for representing reciprocal values, fx_rdiv_t, see the section on reciprocal value division.
All functions are prefixed with fx_
. Otherwise, the names follow the naming convention of the math library in C99 as much as possible. This means that functions are named as their floatingpoint equivalents, with a suffix x
for fixedpoint.
A fixedpoint number is generally specified as a representation xval
, and the number of fraction bits (binary point), frac
.
The error bound is measured in ulp, unit in the last place, which is the smallest representable value. The error for the conversions and the arithmetic operations is within 0.5 ulp. For the remaining math functions the error is higher. The error bounds for these functions are determined empirically using random tests.
Forcing all math functions to be correct to 0.5 ulp would result a severe performance hit. In normal use cases it is generally acceptable to loose one or two bits of precision if this makes the operation several times faster. If a higher precision is needed, choose a fixedpoint format with more fraction bits.
There are no special number such as NaN
and Inf
in the IEEE754 floatingpoint standard. For any fixedpoint format, all possible bit patterns in the 32bit representation are valid numbers.
There is currently no support for saturating arithmetics. If the computed result is outside the representable range, it will wrap around.
Some fixedpoint operations are independent of the number of fraction bits. This means that they are the same as in the special case of zero fraction bits, i.e. integers. Since the fixedpoint numbers are represented as integers, the normal integer operations in C may be used in these cases. This applies to the following:
0
is always 0
in all fixedpoint formats.==
, !=
, <
, <=
>
and >=
can be used if both operands have the same binary point.+
, +=
, 
, =
can be used if both operands have the same binary point.+
and 
always work.No checks or error handling is performed by the conversions and the arithmetic operations. In most operations an error condition will just produce an erroneous result, but fx_divx() and fx_rdivx() may trigger dividebyzero and NULL pointer reference errors, respectively.
The math functions check the arguments and sets the errno
variable on errors, similar to what the corresponding functions in the C math library does.
For completeness, fx_addx() and fx_subx() are also provided for addition and subtraction. They are equal to the integer operations +
and 
, respectively.
For optimal performance, the frac
arguments should be constants in all conversions and arithmetic operations.
Using fx_ftox() and fx_dtox() with constant arguments will generate a fixedpoint compiletime constant. No runtime floatingpoint operations will be performed.
Use the base2 logarithm and exponential functions instead of the natural or base10 versions. It is also more efficient to compute 2^{ylog2(x)} instead of using fx_powx(). This is because of extra overhead imposed by the normalization and the checks that must be done in order to minimize the loss of precision in the intermediate results. The user who knows the data range and fractional format can cut corners here.
There is a benchmark application in the prof
directory. It runs performance tests for all API functions and some floatingpoint equivalents. The result is of course platformdependent, but for CPUs without floatingpoint support in hardware, the fixedpoint arithmetic operations are around 10 times faster than the corresponding softwareemulated floatingpoint operations. For many of the math functions, the speedup is around 100 times.
All operations are implemented as functions. This makes it possible to take the address of the function for all operations. However, some functions may also be implemented as macros, and they may evaluate the frac
argument twice. For this reason it is an error to call a function with a frac
argument that has side effects, e.g. frac++
. The conversion operations fx_ftox() and fx_dtox() may also evaluate floatingpoint argument twice.
The library is written in ANSIC, but it requires the C99 stdint.h
header. The correctness tests require also the C99 stdbool.h
header and the Check library [1]. The benchmark application, which is not built by default, requires the POSIX sigaction() and setitimer() facilities.
The library contain the following functionality:
To use the fixmath library, include the header <fixmath/fixmath.h>
, and link the application with lfixmath
, or use pkgconfig
to determine the compiler and linker flags.
Assume that we want to compute the 2norm of a data vector, i.e.
All data points are in the range [01]. The floatingpoint implementation is shown below.
When converting the operation to fixedpoint, we need to choose a fixedpoint format, i.e. the number of integer and fraction bits. It is of course dependent on the range of the data set. In this example, we use 16 bits for the fraction part, i.e. Q.16.
In the previous example, the data array was given in fixedpoint format. Now we want to compute the same result on the original data that is stored as 8bit unsigned integers. Since the range of the input data now is in the range [0255], we want to normalize them by dividing with 255.
The implementation below computes the sum of squares as integers. The normalization is performed at by converting the numerator and the denominator to the fixedpoint format before computing the normalizing division. Unfortunately, the converted operands are too large for the Q.16 fixedpoint format, and will overflow.
We can solve the problem by performing the fixedpoint division and conversion simultaneously. The number of fraction bits of the division result is f_{x}  f_{y} + frac
, where f_{1} and f_{2} are the number of fraction bits for the first and second operands, respectively. In our case, both f_{1} and f_{2} are zero. If we use 16 bits as the frac
argument, we will get 16 fraction bits for the result. The complete implementation is shown below.
We want to normalize the 8bit data to the range [01] and store it in a Q.16 fixedpoint format. To do this, we normalize and convert the data in one step, as shown in the next code example.
When benchmarking the above implementation, it turns out to be quite slow. This caused by the perdata point division in the inner loop. In a floatingpoint implementation one would multiply each data point by the reciprocal value 1/255 instead of dividing by 255. When using fixedpoint arithmetics, the reciprocal value may very well be outside the representable range. For this reason there is a special data type for storing reciprocal values, fx_rdiv_t. The functions fx_invx() and fx_isqrtx() both accept an optional fx_rdiv_t argument that can be used for performing multiple division by the same value. In the code below we compute the reciprocal value outside the loop, and then use the fx_rdivx() operation, which is really a multiplication, instead of the fx_div() division operation.
[1] Check, http://check.sourceforge.net