Overlap communication and packing/unpacking tasks with TAMPI

To apply after: Parallelize packing and unpacking regions

We consider it is a good practice to taskify these operations allowing them to execute concurrently and far before the actual MPI call in the case of packs or far after in the case of unpacks. These operations are typically not very large and very memory bandwidth bound. For that reason we think it is a good practice not to parallelize each of them with fork join parallel do for each of them as granularity will be very fine and the overhead will have a significant impact. The size typically varies a lot between different pack/unpacks to/from different processes. Using a sufficiently large grain size may allow for some of these operations to be parallelized if the message is large while executing as a single task for small buffers (see dependence flow between pack and MPI_Isend() operations in the following code). (more...)

An easy way to implement communication and computation overlap consists on leveraging the existing communication and computational tasks, relying on task dependencies to deal with their synchronizations and link the application with the TAMPI library.

The TAMPI library captures the MPI blocking services, allowing to include Task Scheduling Points while executing the service. If the send or receive operation have not finished when invoking the MPI service, the runtime will be able to start/resume the execution of a different task at this point.

As the communication tasks can be potentially executed by any thread, we need to ensure that MPI provides the MPI_THREAD_MULTIPLE mode that supports the concurrent invocation of MPI calls from multiple threads. In fact, the TAMPI library extra functionalities extend the threading level in a way that library implementers call MPI_TASK_MULTIPLE (i.e., allowing any thread to execute a communication tasks and transforming blocking operations into non-blocking operations to also insert a runtime entry point allowing task switching).

The resulting code will leverage the taskyfication of packing/unpacking and communication services, and then substitute the call to MPI_Waitall with TAMPI_Waitall(). The code may include either blocking or non-blocking services, so a version with MPI_Send() and MPI_Recv() will also work, as far as dependences synchronize producer and consumer tasks (in the following code the order between receive and unpacking operations are guarantee because of the TAMPI_Waitall service).

#include <tampi.h>

#pragma omp parallel
{
#pragma omp for
for (int i=0; i<SIZE; i++) compute(data);

#pragma omp single
{
   for (int n=0; n<n_neighs; n++) {
      #pragma omp task
      MPI_Irecv(&r_buff[n], n, ...);

      #pragma omp task depend(out: s_buff[n])
      pack( s_buff[n], data, n ); // packing sending buffer for neighbor 'n'
      #pragma omp task depend(in: s_buff[n])
      MPI_Isend(&s_buff[n], n, ...);
   }
   TAMPI_Waitall();

   for (int n=0; n<n_neighs; n++) {
      #pragma omp task
      unpack (r_buff[n], data, n); // unpacking receiving buffer for neighbor 'n'
   }
} // end of single
} // end of parallel

You will also need to compile and link your code with the TAMPI Library:

$ mpicxx -cxx=clang++ -fopenmp -I${TAMPI_HOME}/include your_program.cpp -o your_program.bin -ltampi -L${TAMPI_HOME}/lib

You may obtain further information at: https://github.com/bsc-pm/tampi