Many HPC codes make use of floating-point (FP) numbers for their simulations. When FP numbers get very close to zero (e.g., in the order of 1E-38 for single-precision), they might not be representable with full precision in binary form anymore and become subnormal numbers. Subnormals are often treated in a special way in hardware, so FP arithmetic with subnormals requires significantly more clock cycles. As a consequence, calculating with subnormal numbers leads to a lower IPC compared to the calculation with normal numbers.
The performance problem of subnormal numbers originates from how FP numbers are represented in binary form, as defined by the IEEE 754 standard [1]. IEEE 754 defines formats with different precision (number of bits), such as single-precision representations (32-bit) and double-precision representations (64-bit), but there are also smaller (half precision) and larger (quadruple precision) ones. The three components of an IEEE 754 single-precision FP number (32-bit) are the sign (1-bit), the exponent (8-bit), and the mantissa (23-bit). The larger the number of available bits for exponent and mantissa, the larger the range and higher the precision of representable FP numbers. For single-precision, the absolute smallest value according to IEEE 754 is \(1.18 \cdot 10^{-38}\) and the absolute largest value is \(3.4 \cdot 10^{38}\). Absolute values larger than \(3.4 \cdot 10^{38}\) will get the value “infinity”. For absolute values smaller than \(1.18 \cdot 10^{-38}\), the value can be (1) directly cut off to zero (named “flush to zero”) or (2) interpreted as subnormal where values up to \(1.40 \cdot 10^{-45}\) are still representable, but with a lower precision. Subnormals may avoid numerical problems that might come along with a hard cutoff of flush to zero.
As subnormal numbers have mantissas with leading zeros, floating-point pipelines in hardware have to treat them specially. This may require additional clock cycles, ranging from a few cycles up to hundreds, depending on the CPU or GPU [2]. Therefore, any FP operation involving subnormals typically takes significantly longer than with normals (if not explicitly instructed to always flush subnormals to zero).
HW co-design: FP operations with subnormals require in Intel microarchitectures around a hundred additional cycles [2, p. 167], while in AMD Zen they require only a few additional cycles [2, p.234]. The Intel optimization manual [3] also mentions that calculations with subnormals (underflow condition) trigger an expensive assist via microcode.
A typical symptom of subnormal FP calculations is a low IPC and a relatively low number of FLOPS. Thus, if the IPC and the number of achieved FLOPS is low while the memory subsystem (caches, main memory) is not the bottleneck, this is an indicator for this pattern. For Intel CPUs, there is a hardware performance counter named “ASSISTS:FP” [3] that counts how often microcode support for FP operations was required, e.g., when a FP operation involving subnormals is encountered. Other than that, the application code itself might be modified to check for FP exceptions (underflow) [4] or more fine-grained by using library functions such as fpclassify(myfloat) in C [5] or ieee_is_normal(myfloat) in Fortran [6] to check if a given value myfloat is subnormal.
The following kernel is a simple example of a synthetic 1D stencil kernel to illustrate the problem:
float *u = (float *)malloc(N * sizeof(float));
float *u_new = (float *)malloc(N * sizeof(float));
// initialization of u and u_new omitted
// 1D halo
for (int iter = 0; iter < ITER; iter++) {
for (int i = 1; i < N - 1; i++) {
// will lead to subnormal calculations after some iterations
u_new[i] = (u[i - 1] + u[i] + u[i + 1]) * 0.1;
}
// swap arrays
float *tmp = u; u = u_new; u_new = tmp;
}
The multiplication with 0.1 will lead to fast shrinking values in the array and, therefore, to calculations with subnormals after some iterations.
In real applications, the calculation with subnormals might occur in many different situations and might not be as obvious as in the example above. For example, the input data could already consist of small values and the values get subnormal only temporarily. A real-world application thay may run into subnormals is DENISE Black Edition [7] performing a 2D full waveform inversion code relying on finite-difference modeling. The code simulates how waves propagate through a medium. It is parallelized with MPI and uses a geometric decomposition to distribute the work. Depending on the input model, the computed resulting values get subnormal in different parts of the grid in some iterations. This also means that the same compute kernel takes different amounts of time, depending on the number of subnormals present in the subgrid. The following trace shows the effect on an execution with 12 MPI processes:
In every iteration, each rank calls the same update method and then exchanges halo values (ghost cells) with its neighbors using MPI_Sendrecv_replace. The trace shows that most ranks stall in MPI_Sendrecv_replace waiting to complete. Only ranks 1 and 4 are always busy because for the same updatem method they need several times longer. For those ranks, the IPC is significantly smaller (1.17 vs. 4.75). Also, the hardware performance counter “ASSISTS:FP” shows a high value, indicating a high number of floating-point assists. On a higher level, the affected ranks alternate during the computation, as visible in the following trace:
The temporal low IPC for some invocations of the update method leads to an alternating load imbalance, manifesting as a low serialization efficiency for this particular code.