We consider it is a good practice to taskify these operations allowing them to
execute concurrently and far before the actual MPI call in the case of packs or
far after in the case of unpacks. These operations are typically not very large
and very memory bandwidth bound. For that reason we think it is a good practice
not to parallelize each of them with fork join parallel do for each of them as
granularity will be very fine and the overhead will have a significant impact.
The size typically varies a lot between different pack/unpacks to/from
different processes. Using a sufficiently large grain size may allow for some
of these operations to be parallelized if the message is large while executing
as a single task for small buffers (see dependence flow between pack
and
MPI_Isend()
operations in the following code).
(more...)
An easy way to implement communication and computation overlap consists on leveraging the existing communication and computational tasks, relying on task dependencies to deal with their synchronizations and link the application with the TAMPI library.
The TAMPI library captures the MPI blocking services, allowing to include Task Scheduling Points while executing the service. If the send or receive operation have not finished when invoking the MPI service, the runtime will be able to start/resume the execution of a different task at this point.
As the communication tasks can be potentially executed by any thread, we need
to ensure that MPI provides the MPI_THREAD_MULTIPLE
mode that supports the
concurrent invocation of MPI calls from multiple threads. In fact, the TAMPI
library extra functionalities extend the threading level in a way that library
implementers call MPI_TASK_MULTIPLE
(i.e., allowing any thread to execute a
communication tasks and transforming blocking operations into non-blocking
operations to also insert a runtime entry point allowing task switching).
The resulting code will leverage the taskyfication of packing/unpacking and
communication services, and then substitute the call to MPI_Waitall
with
TAMPI_Waitall()
. The code may include either blocking or non-blocking
services, so a version with MPI_Send()
and MPI_Recv()
will also work, as
far as dependences synchronize producer and consumer tasks (in the following
code the order between receive and unpacking operations are guarantee because
of the TAMPI_Waitall
service).
#include <tampi.h>
#pragma omp parallel
{
#pragma omp for
for (int i=0; i<SIZE; i++) compute(data);
#pragma omp single
{
for (int n=0; n<n_neighs; n++) {
#pragma omp task
MPI_Irecv(&r_buff[n], n, ...);
#pragma omp task depend(out: s_buff[n])
pack( s_buff[n], data, n ); // packing sending buffer for neighbor 'n'
#pragma omp task depend(in: s_buff[n])
MPI_Isend(&s_buff[n], n, ...);
}
TAMPI_Waitall();
for (int n=0; n<n_neighs; n++) {
#pragma omp task
unpack (r_buff[n], data, n); // unpacking receiving buffer for neighbor 'n'
}
} // end of single
} // end of parallel
You will also need to compile and link your code with the TAMPI Library:
$ mpicxx -cxx=clang++ -fopenmp -I${TAMPI_HOME}/include your_program.cpp -o your_program.bin -ltampi -L${TAMPI_HOME}/lib
You may obtain further information at: https://github.com/bsc-pm/tampi