Identifying suitable MPI programs for execution on GPU

This text outlines criteria that can be used to identify MPI programs that can be implemented on GPUs. The first criterion considered is the structure of the code where an iterative structure is favorable. The second criterion is the algorithm itself, that should have a high level of inherent parallelism. The third criterion is the size of the data set that should not exceed the memory available on the GPUs.

Code structure

The computational problem allows partitioning of the problem using the MPI programming model. The algorithm exhibits a structure of repeating computational tasks, e.g. a computation within a loop as shown below. In this example, computation_a and computation_b perform the computations of the underlying algorithm computation_c occurs after data was exchanged among neighbouring patches. Finally, a test is performed to decide if another iteration is computed.

while (iterating == true)
   computation_a
   computation_b
   data_exchange_with_neighbours
   computation_c
   iterating = convergence_test
done

Codes that do not perform iterative computations are not unsuitable for execution on GPUs in general but they are not addressed by the programming model described in the recommended best practice. The computational work within a single iteration should be significantly more compared to the computation performed during data exchange among the MPI ranks between two iterations. Referring to the example above, the majority of the computational work should be performed in routines computation_a, computation_b and computation_c and not in data_exchange_with_neighbours.

An ideal candidate for execution on a GPU is code that performs computations in a loop because these code segments can be translated to a GPU kernel easily. One common example is the nested loop below.

for (x = 0..nx)
   for (y = 0..ny)
      for (z = 0..nz)
         result(x,y,z) = function_a(data_a(x,y,z),data_b(x,y,z),...)
      done
   done
done

The three loops can be collapsed in to a single loop and each iteration can be executed by a single thread of the GPU. A GPU kernel in pseudo code would look like this:

id=global_thread_id

// compute x,y and z from global id
x = id / (ny * nz)
y = (id - x * ny * nz) / nz
z = id - x * ny * nz - y * nz

// compute one element of the result
result(x,y,z) = function_a(data_a(x,y,z),data_b(x,y,z),...)

Code segments that are not written as a single loop can usually be rewritten in a form that is more suitable for parallel execution. The usage of the OpenMP programming model indicates regions that are suitable for execution on a GPU. Most likely, a part of the CPU code can not be implemented efficiently on a GPU, e.g. book keeping or sequential regions. Implementing these parts creates a challenge for the programmer.

Algorithm

The algorithm must have a high degree of inherent parallelism, i.e. many computations can be performed in parallel, independent of each other. In order to write GPU kernels that exploit the parallelism, a deep understanding of the algorithm is necessary. The reference implementation should be written in a way that helps understanding the algorithm. It is often helpful to study a version of the reference code that has not been manually optimized for maximum performance in order to get a better understanding of the algorithm.

Some ideas used for the CPU version of a parallel code can also be used for the GPU version, for example the domain decomposition scheme. On the other hand, common concerns against porting a code from CPU to GPU are not prohibiting the porting to GPU. Some common concerns are:

  • The algorithm should not be judged by the computations performed in the reference implementation. GPU can not only execute vectorized code efficiently but any code.
  • If a code performs a significant amount of random memory access, that should not lead to the conclusion that the algorithm is not suitable for execution on a GPU because the random memory access rate of a GPU is in usually higher than that of a CPU.
  • Data transfer between the host and the GPU is often considered a bottle neck. However, the bandwidth between the host and the GPU is often the same as the bandwidth between the host and the network card. As with a data transfer via the network, a data transfer between host and GPU can be overlapped with computations on the GPU.

Size of data set

The amount of data (D) that is used in the computation must not exceed the amount of memory available on a single GPU (G). In current GPUs, G is about 16 to 64 GB. D must include all data that is necessary for a single MPI rank to perform the computational task within a domain decomposition scheme. Contributions to D are for example, global arrays required for the computation, like boundary conditions but also cell data local to the MPI rank. The total memory available on the GPUs might be smaller than the total memory available on the CPU nodes, simply because there might be less GPU nodes available.

Recommended best-practice(s):