Avoiding subnormal floating-point calculations

Pattern addressed: Calculating with subnormal numbers

Many HPC codes make use of floating-point (FP) numbers for their simulations. When FP numbers get very close to zero (e.g., in the order of 1E-38 for single-precision), they might not be representable with full precision in binary form anymore and become subnormal numbers. Subnormals are often treated in a special way in hardware, so FP arithmetic with subnormals requires significantly more clock cycles. As a consequence, calculating with subnormal numbers leads to a lower IPC compared to the calculation with normal numbers. (more...)

Required condition: The code calculates with subnormal floating-point numbers as detected via hardware performance counter or via checking for floating-point exceptions.

Subnormal floating-point (FP) calculations should be avoided as they might slow down the execution time of floating-point operations significantly. This best practice discusses three solutions to avoid subnormals in an application: using flush-to-zero (FTZ) mode of the CPU, selecting an appropriate FP data type, and a numerical solution such as softening.

Flush-to-zero

To avoid any delays in FP calculations due to subnormals, the FP units of CPUs can be configured to use “flush-to-zero” (FTZ) and “denormals are zero” (DAZ). The first one will ensure that any FP operation with a subnormal result becomes 0, the latter one ensures that any subnormal input for an FP operation will be treated as 0. The behavior of CPUs can be controlled via registers (MXCSR [1] in x86 and FPCR [2] in ARM64) during program execution. To avoid adding CPU instructions in the program manually, the behavior can also be controlled by passing compiler flags -mdaz-ftz (GCC [3] / Clang [4]) or -ftz (Intel [5]). For GCC/Clang, FTZ/DAZ have to be enabled explicitly, independent of the chosen optimization level. For the Intel compilers, FTZ/DAZ is turned on by default except when compiling with -O0 [5]. For other compilers, the manual should give a hint how to enable FTZ/DAZ.

While FTZ/DAZ is an effective measure to avoid additional cycles in floating-point operations, it might also have effects on the computation results. This might range from slight inaccuracies up to wrong results. For example, a subnormal FP value used as a divisor should not be treated as zero. Although it is possible to enable and disable FTZ/DAZ at any time during the program execution, it requires a lot of additional programming effort. In that case, it might be better to check for an alternative data type or a numerical solution.

Using an appropriate data type

If the data type for floating-point values is only single/half-precision and FTZ/DAZ will lead to inaccurate results, it is advisable to choose a data type with a higher precision to avoid subnormals. If the data type already has double-precision, it is typically not helpful to go to higher-precision data types such as float128 [6] as operations on them are significantly slower as they are not implemented in hardware.

Numerical solution

When subnormals occur in the course of an application due to values getting close to zero, they can also be avoided by always adding a small constant to the calculated result. For example, in N-body simulations this method is called softening [7]. It avoids that particles very close to each other, i.e., with a subnormal distance, will get an unrealistically high force assigned with the additional side effect of avoiding subnormal numbers. The effectiveness of this method depends on the concrete application, as it might also alter the final results of the computation.

Effects on applications

In the application DENISE Black Edition [8] that runs into subnormals for certain inputs, using FTZ reduces the execution time significantly (speedup of more than a factor of 2), since the serialization of the execution is resolved, as the following traces with 12 MPI processes confirm. The trace with no FTZ shows a high number of MPI waiting time (red) in a halo exchange and only short periods of useful computation (green):

After enabling FTZ via compiler flag, the serialization is resolved, and the high amount of MPI waiting time (red) in the halo exchange is much smaller since the imbalances due to temporal subnormal calculations are eliminated:

The POP metrics for the relevant method confirm that the serialization is resolved by enabling FTZ:

DENISE Black sh Method, 12 MPI processes	FTZ disabled	FTZ enabled
Execution Time (s)	154.52	65.46
Parallel Efficiency	0.42	0.86
- Load Balance Efficiency	0.94	0.97
- Communication Efficiency	0.45	0.89
– Serialization Efficiency	0.46	0.98
– Transfer Efficiency	0.97	0.91
IPC	3.19	4.89

References

Intel 64 and IA-32 Architectures Software Developer’s Manual, Vol. 1, 10-4, https://cdrdv2.intel.com/v1/dl/getContent/671200
https://developer.arm.com/documentation/ddi0601/2025-09/AArch64-Registers/FPCR–Floating-point-Control-Register
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-mdaz-ftz
https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang-mdaz-ftz
https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2025-0/ftz-qftz.html
https://www.boost.org/doc/libs/latest/libs/multiprecision/doc/html/boost_multiprecision/tut/floats/float128.html
Optimal softening for force calculations in collisionless N-body simulations, https://doi.org/10.1046/j.1365-8711.2000.03316.x
https://github.com/daniel-koehn/DENISE-Black-Edition

Recommended in program(s): DENISE Black Edition subnormals kernel ·