This repository contains a small application reproducing performance issues due to GPU affinity. It repeatedly performs a single precision a times x plus y (SAXPY) operation: \(Y \leftarrow a X + Y\)
It uses OpenMP target offloading to perform this operation on the GPU. Further it uses MPI to devide huge vectors into smaller parts and computes them in parallel. Each MPI rank uses one device to compute its partial result.
This kernel has two versions:
device
clause to specify a device to offload the kernel to. Further, the --cpu-bind
option is used in the srun
command to specify which task is running on which NUMA domain.The clang
compiler with openmp-target offloading support is required along with an MPI implementation.
Before building edit line 2 of src/Makefile
so that the given GPU architecture matches the hardware on your target system. The default value is sm_90
which matches Nvidia H100 GPUs.
To build this program navigate to the src
folder and run make
:
$ cd src
$ make
This will generate a binary called kernel.exe
.
You can execute this kernel by running
$ mpirun -np <number of ranks> ./kernel.exe <size of vectors x and y>
The number of ranks should match the number of GPUs in the system. Rank x will offload to device x.