GPU affinity on one node with 8 GPUs (LUMI-G)

This experiment shows the difference in runtimes is shown between running with default CPU binding and manual binding to NUMA domains.

Launch configuration

Both runs were launched with 8 MPI tasks, filling up one node with 8 GPUs. Each task offloaded to one GPU.

The run including binding the tasks to the NUMA domains was launched as follows:

srun --cpu-bind=map_ldom:3,3,1,1,0,0,2,2 ./kernel.exe 8000000000

System

LUMI GPU node

  • CPU: 1x 64-core AMD EPYC 7A53 “Trento”
  • GPU: 4x AMD MI250x (8 usable devices (GCD))
  • GPU NUMA affinity:
GPU 0 1 2 3 4 5 6 7
NUMA domain 3 3 1 1 0 0 2 2

Results

The results of the two runs are shown in the plot below. The runtime of the target region differs significantly depending on the CPU/GPU combination that executes it.

LUMI Results

This table shows the mapping between tasks and CPUs for the two runs:

MPI Rank 0 1 2 3 4 5 6 7
CPU / NUMA domain (no binding) 1 / 0 2 / 0 3 / 0 4 / 0 5 / 0 6 / 0 7 / 0 8 / 0
CPU / NUMA domain (with binding) 49 / 3 57 / 3 17 / 1 25 / 1 1 / 0 9 / 0 33 / 2 41 / 2
Device / NUMA domain 0 / 3 1 / 3 2 / 1 3 / 1 4 / 0 5 / 0 6 / 2 7 / 2

Memory channel usage: The huge difference observed between the default binding and the optimal binding is mostly caused by a congestion of the memory channels that are used. In the default binding, all MPI ranks run on the same NUMA domain. This causes all data copies to the GPUs to take a single memory channel.

To show the impact of the actual affinity effect a second run is performed where the optimal binding is compared with a suboptimal binding. The suboptimal binding still distributes the processes over the different NUMA domains (1 MPI rank per NUMA domain) to maximize the memory channel usage. However, the used CPU cores and NUMA domains are not close to the GPUs that the MPI rank offloads to. The suboptimal binding is map_cpu:1,9,17,25,33,41,49,57 corresponding to the NUMA domains 0,1,2,3,4,5,6,7. The results of this are shown below:

LUMI binding comparison