GPU SAXPY

Program's name: GPU SAXPY
Available version(s): Programming language(s): C ·
Programming model(s): MPI · OpenMP ·

This repository contains a small application reproducing performance issues due to GPU affinity. It repeatedly performs a single precision a times x plus y (SAXPY) operation: \(Y \leftarrow a X + Y\)

It uses OpenMP target offloading to perform this operation on the GPU. Further it uses MPI to devide huge vectors into smaller parts and computes them in parallel. Each MPI rank uses one device to compute its partial result.

This kernel has two versions:

  • The version on the main branch is using no CPU binding and does not specify the device to offload the SAXPY kernel to. It therefore has no control over which CPU offloads to which device.
  • The version on the cpu-binding branch uses a device clause to specify a device to offload the kernel to. Further, the --cpu-bind option is used in the srun command to specify which task is running on which NUMA domain.

Build instructions

The clang compiler with openmp-target offloading support is required along with an MPI implementation.

Before building edit line 2 of src/Makefile so that the given GPU architecture matches the hardware on your target system. The default value is sm_90 which matches Nvidia H100 GPUs.

To build this program navigate to the src folder and run make:

$ cd src
$ make

This will generate a binary called kernel.exe.

Executing the program

You can execute this kernel by running

$ mpirun -np <number of ranks> ./kernel.exe <size of vectors x and y>

The number of ranks should match the number of GPUs in the system. Rank x will offload to device x.