MPI programmers often use non-blocking calls to overlap communication and computation. In such codes, the MPI process communicates with its neighbors through a sequence of stages: 1) an optional pack of data (if needed); 2) a set of send/receive non-blocking operations, which potentially could overlap one to each other; 3) wait for communications (potentially splitting for send and receive requests; and 4) the computational phase. (more...)
The recommended best-practice in this case will consist on postponing the execution of wait on send buffer as much as possible, in order to potentially overlap the send operation with the Computational phase. There are several ways we can delay this operation.
Firstly, we can modify the program itself, and thus the lexicographical order of the send waitall service. The new placement of this operation could potentially postponed until the reuse of the sending buffer, which in this case is just before the packing service (in the next iteration):
for(it=0; i<ITERS; it++) {
pack(data, s_buffer);
for (n=0; n<n_neigbours; n++) {
MPI_Irecv (n, r_buffer[n], irecv_reqs[n]);
MPI_Isend (n, s_buffer[n], isend_reqs[n]);
}
MPI_waitall (irecv_reqs);
unpack(r_buffer, data);
Computation(data); // Parallelized with OpenMP
MPI_waitall (isend_reqs);
} // End of the loop on ITERS
As the waitall for a given operation is executed at the beginning of the next iteration, we also need to guarantee that the last iteration has the corresponding waitall service. We must include a waitall operation immediately following the loop, although it could be postponed until the next use of the sending buffer.
A second alternative will consist on keeping the lexicographical order, but taskify the service in order to postpone the execution (actually the finalization) until actually required. In that approach we will need to use a proper MPI threading-support, and leverage the use of task synchronization mechanism in order to guarantee task finalization:
for(it=0; i<ITERS; it++) {
pack(data, s_buffer);
for (n=0; n<n_neigbours; n++) {
MPI_Irecv (n, r_buffer[n], irecv_reqs[n]);
MPI_Isend (n, s_buffer[n], isend_reqs[n]);
}
MPI_waitall (irecv_reqs);
#pragma omp task depend(out: s_buffer)
MPI_waitall (isend_reqs);
unpack(r_buffer, data);
Computation(data); // Parallelized with OpenMP
#pragma taskwait depend(out: s_buffer)
} // End of the loop on ITERS
As in the previous case, the taskwait waits at the beginning of the next iteration, and we need to guarantee that the task created in the last iteration also finishes its execution. In the previous pseudo-code we include the taskwait just immediately following the loop, but it could be postponed until any further use of the sending buffer.
SW Co-design (MPI): We should take into account the MPI progression engine policy. If the progression thread is not available, the non-blocking send operations could be postponed until next MPI service entry.
If programmers decide to use the task version of this approach, it is also worthy to mention that the wait-for-send tasks could leverage the use of any priority mechanism available in the task-based runtime system to assign a low priority value.
Recommended in program(s): False communication-computation overlap (original) ·
Implemented in program(s): False communication-computation overlap (postpone-wait) ·