This version implements a distributed matrix matrix multiplication where computation is spread across GPUs and data is exchanged using MPI collectives. There are two different variants included: one using explicit data transfers between host and GPU and the other leveraging the GPU-aware feature of common MPI libraries.
Without GPU-aware MPI, data needs to reside on the host memory for MPI communication. Hence, each time data is computed on the GPUs that has to be aggregated on a single process, it needs to be first transferred from GPU to CPU. This variant variant can be executed with
mpirun matmul KERNEL UNAWARE
Utilizing the GPU-aware MPI feature, the application developer no longer needs explicit data transfers. Instead, device buffers can be directly used in MPI calls. Data transfers between host and GPU are either completely removed or optimized by the MPI library leading to improved runtime performance. The GPU-aware MPI variant can be executed with
mpirun matmul KERNEL AWARE