Porting code to GPU (iterative kernel execution)

Pattern addressed: Identifying suitable MPI programs for execution on GPU

This text outlines criteria that can be used to identify MPI programs that can be implemented on GPUs. The first criterion considered is the structure of the code where an iterative structure is favorable. The second criterion is the algorithm itself, that should have a high level of inherent parallelism. The third criterion is the size of the data set that should not exceed the memory available on the GPUs. (more...)

This text describes a programming pattern for GPUs where the computation is performed iteratively. Between iterations, data exchange among MPI ranks takes place. The code implements classical domain decomposition where each patch is processed by a dedicated MPI rank using a single GPU. In an ideal case, the existing domain decomposition can be reused while the computation is performed by the GPU instead of the CPU.

General

The described programming pattern is very similar to the traditional pattern used for CPU codes in the way that multiple MPI ranks run on multiple compute nodes. The difference is that the GPU is not regarded as an accelerator for certain parts of the code but as the main computing unit. Each MPI rank uses one GPU at first in order to reduce the complexity of the code and to avoid load imbalance within a single MPI rank. Each MPI rank can use multiple threads for performing computations during the execution of GPU kernels and during the data exchange among MPI ranks when the GPUs are idle.

Code structure

The code consists of three phases: Initialization, iterative computation and finalization. These stages are explained in the following.

1. Initialization:

In this phase, the GPU environment is initialized. In an OpenCL environment, this includes the creation of the device context, the program binary, the queues, the kernels and the memory objects. The data set is loaded and distributed among the GPUs using common domain decomposition techniques. The data is copied to the GPU memory.

2. Iterative computation:

The main computational work is performed in this phase. The host enqueues kernels that run on the GPU. After that, the host waits for the completion of the kernels. The next step is to transfers data that is required for the inter domain communication among all GPUs. In addition, the host might upload data that is required as input for each iteration. In each iteration, a convergence test is performed to decide whether the next iteration is performed. The second stage of a typical host program would look like this:

while (iterating == true)
   enqueue GPU kernel for computation_a (GPU)
   enqueue GPU kernel for computation_b (GPU)

   while (wait for completion of GPU kernels)
      do something useful on the CPU, e.g. compute boundary condition, I/O, check convergence

   data_exchange_with_neighbours
   upload required data to GPU

   computation_c on CPU

   iterating = convergence_test
done

3. Finalization:

The data set is saved and the GPU environment is cleaned up.

Recommended in program(s): CPU to GPU, CPU version ·

Implemented in program(s): CPU to GPU, OpenCL version ·