This experiment shows the difference in runtimes is shown between running with default CPU binding and manual binding to NUMA domains.
Both runs were launched with 4 MPI tasks, filling up one node with 4 GPUs. Each task offloaded to one GPU.
The run including binding the tasks to the NUMA domains was launched as follows:
srun --cpu-bind=map_ldom:0,2,4,6 ./kernel.exe 8000000000
CLAIX-2023-ML node
GPU | 0 | 1 | 2 | 3 |
---|---|---|---|---|
NUMA domain | 0 | 2 | 4 | 6 |
The results of the two runs are shown in the plot below. The runtime of the target region differs for MPI ranks 2 and 3. This happens since with no CPU binding the processes are executed on cores that are on a different socket than the GPU that they are using. This leads to a lower bandwidth and a higher latency.
This table shows the mapping between tasks and executing units:
MPI Rank | 0 | 1 | 2 | 3 |
---|---|---|---|---|
CPU / NUMA domain (no binding) | 0 / 0 | 12 / 1 | 24 / 2 | 36 / 3 |
CPU / NUMA domain (with binding) | 1 / 0 | 24 / 2 | 49 / 4 | 72 / 6 |
Device / NUMA domain | 0 / 0 | 1 / 2 | 2 / 4 | 3 / 6 |
Using a benchmark without computation the bandwidth between CPU and GPU (in both directions) can be measured. The results for CLAIX23 are shown below: