GPU affinity on one node with 8 GPUs (LUMI-G)

This experiment shows the difference in runtimes is shown between running with default CPU binding and manual binding to NUMA domains.

Launch configuration

Both runs were launched with 8 MPI tasks, filling up one node with 8 GPUs. Each task offloaded to one GPU.

The run including binding the tasks to the NUMA domains was launched as follows:

srun --cpu-bind=map_ldom:3,3,1,1,0,0,2,2 ./kernel.exe 8000000000

System

LUMI GPU node

CPU: 1x 64-core AMD EPYC 7A53 “Trento”
GPU: 4x AMD MI250x (8 usable devices (GCD))
GPU NUMA affinity:

GPU	0	1	2	3	4	5	6	7
NUMA domain	3	3	1	1	0	0	2	2

Results

The results of the two runs are shown in the plot below. The runtime of the target region differs significantly depending on the CPU/GPU combination that executes it.

LUMI Results

This table shows the mapping between tasks and CPUs for the two runs:

MPI Rank	0	1	2	3	4	5	6	7
CPU / NUMA domain (no binding)	1 / 0	2 / 0	3 / 0	4 / 0	5 / 0	6 / 0	7 / 0	8 / 0
CPU / NUMA domain (with binding)	49 / 3	57 / 3	17 / 1	25 / 1	1 / 0	9 / 0	33 / 2	41 / 2
Device / NUMA domain	0 / 3	1 / 3	2 / 1	3 / 1	4 / 0	5 / 0	6 / 2	7 / 2

Memory channel usage: The huge difference observed between the default binding and the optimal binding is mostly caused by a congestion of the memory channels that are used. In the default binding, all MPI ranks run on the same NUMA domain. This causes all data copies to the GPUs to take a single memory channel.

To show the impact of the actual affinity effect a second run is performed where the optimal binding is compared with a suboptimal binding. The suboptimal binding still distributes the processes over the different NUMA domains (1 MPI rank per NUMA domain) to maximize the memory channel usage. However, the used CPU cores and NUMA domains are not close to the GPUs that the MPI rank offloads to. The suboptimal binding is map_cpu:1,9,17,25,33,41,49,57 corresponding to the NUMA domains 0,1,2,3,4,5,6,7. The results of this are shown below:

LUMI binding comparison

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 676553 (POP1) and 824080 (POP2).

Currently, the project receives funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101143931 (POP3). The JU receives support from the European Union's Horizon Europe research and innovation programme and Spain, Germany, France, Portugal and the Czech Republic.