The kernel was evaluated to quantify the performance impact of subnormal floating-point number handling by executing the kernel with and without flush-to-zero (FTZ) mode enabled. The kernel was evaluated on different CPU microarchitectures to compare the slowdown for different hardware designs. The following three CPUs were used for the evaluation:
The command
make perf_ipc
and
make perf_ipc_ftz
was used to run the kernel compiled without and with FTZ, respectively. It returns the execution time of the kernel and the achieved IPC. For the Intel CPU, the number of floating point assists was measured using the perf_assists and perf_assists_ftz make targets. Each configuration was run ten times and the results averaged. The execution was pinned to a single core by using taskset.
In total, the kernel executes approximately 19 billion FLOPs out of 26 billion instructions per run. The Intel CPU reported around 455 million floating-point assists when running without FTZ and zero assists when running with FTZ, confirming that subnormals indeed occur in the kernel. The following table summarizes the execution time and IPC for each CPU (average over 10 runs):
| CPU | no FTZ - Time (s) | FTZ - Time (s) | Speedup of FTZ | no FTZ - IPC | FTZ - IPC |
|---|---|---|---|---|---|
| Intel Xeon 8468 | 16.91 | 1.64 | 10.3x | 0.41 | 4.23 |
| AMD EPYC 9655 | 1.88 | 1.56 | 1.2x | 3.04 | 3.65 |
| NVIDIA GH200 | 1.77 | 1.77 | 1.0x | 5.11 | 5.11 |
The Intel CPU has the highest slowdown due to subnormals and therefore profits most from enabling FTZ, resulting in a speedup of 10x. The subnormal handling in the AMD CPU is not that costly, so the speedup of 1.2x with FTZ is smaller. For the NVIDIA Grace Hopper CPU, subnormals do not influence the execution time of the code at all, indicating that it treats subnormals most efficiently.
In summary, the results show that the effect of subnormals highly depends on the CPU microarchitecture. In general, using FTZ is recommended to ensure performance portability if the precision of subnormal handling is not required by the application.