This kernel is an extraction of the hotspot kernel from DENISE Black Edition, a 2D time-domain isotropic (visco)elastic FD modeling and FWI code. The kernel is extracted from the code part that calculates the propagation of SH waves. It showcases the significant slowdown of subnormal floating-point calculations when they are not handled properly.
The kernel is written in C and based on the methods update_s_elastic_PML_SH and update_v_PML_SH which account for around 90 percent of the execution time. The methods are simplified to only include the parts relevant to the subnormal floating-point operation, especially removing the halo exchange via MPI communication and simplifying the input parameters. The subnormal floating-point values occur in the calculation of vzx / vzy in update_s_elastic_PML_SH and sxz_x / syz_y in update_v_PML_SH.
There is a single version of the kernel provided which can be compiled with or without flush-to-zero (FTZ) enabled to highlight the performance differences of subnormal floating-point handling. Details on how to compile and run the kernel are provided in the corresponding version description.
-mdaz-ftz, either GCC or Clang.perf for measuring the hardware performance countersTo compile the kernel, use the provided Makefile that uses GCC by default for compilation. The Makefile includes the following compile targets:
kernel: Compiles the kernel with default settings, FTZ disabled.kernel_ftz: Compiles the kernel with flush-to-zero (FTZ) mode, enabled by compiling with -mdaz-ftz.kernel_nooutput: As kernel, but without verbose output during execution.kernel_ftz_nooutput: As kernel_ftz, but without verbose output during execution.all: Compiles all versions of the kernel.The following run targets are available in the Makefile:
run: Runs the kernel compiled with FTZ disabled, outputs the execution time and number of subnormals occurring in update_s_elastic_PML_SH (for reference) for each iteration.run_ftz: Runs the kernel with FTZ enabled.perf_ipc: Runs the kernel without FTZ and measures instructions per cycle (IPC) using perf.perf_ipc_ftz: Runs the kernel with FTZ enabled and measures IPC using perf.perf_assists (only available for Intel CPUs): Runs the kernel without FTZ and measures floating-point assists using perf.perf_assists_ftz (only available for Intel CPUs): Runs the kernel with FTZ enabled and measures floating-point assists using perf.