GPU affinity on one node with 4 GPUs (Claix23)

This experiment shows the difference in runtimes is shown between running with default CPU binding and manual binding to NUMA domains.

Launch configuration

Both runs were launched with 4 MPI tasks, filling up one node with 4 GPUs. Each task offloaded to one GPU.

The run including binding the tasks to the NUMA domains was launched as follows:

srun --cpu-bind=map_ldom:0,2,4,6 ./kernel.exe 8000000000

System

CLAIX-2023-ML node

  • CPU: 2x Intel Xeon 8468 Sapphire (2.1 GHz, 48 cores each)
  • GPU: 4x NVIDIA H100 96 GB HBM2e
  • GPU NUMA affinity:
GPU 0 1 2 3
NUMA domain 0 2 4 6

Results

The results of the two runs are shown in the plot below. The runtime of the target region differs for MPI ranks 2 and 3. This happens since with no CPU binding the processes are executed on cores that are on a different socket than the GPU that they are using. This leads to a lower bandwidth and a higher latency.

Claix results

This table shows the mapping between tasks and executing units:

MPI Rank 0 1 2 3
CPU / NUMA domain (no binding) 0 / 0 12 / 1 24 / 2 36 / 3
CPU / NUMA domain (with binding) 1 / 0 24 / 2 49 / 4 72 / 6
Device / NUMA domain 0 / 0 1 / 2 2 / 4 3 / 6

Using a benchmark without computation the bandwidth between CPU and GPU (in both directions) can be measured. The results for CLAIX23 are shown below:

Claix bandwidth results