GPU affinity on one node with 4 GPUs (Claix23)

This experiment shows the difference in runtimes is shown between running with default CPU binding and manual binding to NUMA domains.

Launch configuration

Both runs were launched with 4 MPI tasks, filling up one node with 4 GPUs. Each task offloaded to one GPU.

The run including binding the tasks to the NUMA domains was launched as follows:

srun --cpu-bind=map_ldom:0,2,4,6 ./kernel.exe 8000000000

System

CLAIX-2023-ML node

CPU: 2x Intel Xeon 8468 Sapphire (2.1 GHz, 48 cores each)
GPU: 4x NVIDIA H100 96 GB HBM2e
GPU NUMA affinity:

GPU	0	1	2	3
NUMA domain	0	2	4	6

Results

The results of the two runs are shown in the plot below. The runtime of the target region differs for MPI ranks 2 and 3. This happens since with no CPU binding the processes are executed on cores that are on a different socket than the GPU that they are using. This leads to a lower bandwidth and a higher latency.

Claix results

This table shows the mapping between tasks and executing units:

MPI Rank	0	1	2	3
CPU / NUMA domain (no binding)	0 / 0	12 / 1	24 / 2	36 / 3
CPU / NUMA domain (with binding)	1 / 0	24 / 2	49 / 4	72 / 6
Device / NUMA domain	0 / 0	1 / 2	2 / 4	3 / 6

Using a benchmark without computation the bandwidth between CPU and GPU (in both directions) can be measured. The results for CLAIX23 are shown below:

Claix bandwidth results

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 676553 (POP1) and 824080 (POP2).

Currently, the project receives funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101143931 (POP3). The JU receives support from the European Union's Horizon Europe research and innovation programme and Spain, Germany, France, Portugal and the Czech Republic.