# Resources for Co-design at POP CoE

## Postpone the execution of non-blocking send waits operations

Wait for non-blocking send operations preventing computational progress: MPI programmers often use non-blocking calls to overlap communication and computation. In such codes, the MPI process communicates with its neighbors through a sequence of stages: 1) an optional pack of data (if needed); 2) a set of send/receive non-blocking operations, which potentially could overlap one to each other; 3) wait for communications (potentially splitting for send and receive requests; and 4) the computational phase. (more...)

The recommended best-practice in this case will consist on postponing the execution of wait on send buffer as much as possible, in order to potentially overlap the send operation with the Computational phase. There are several ways we can delay this operation.

Firstly, we can modify the program itself, and thus the lexicographical order of the send waitall service. The new placement of this operation could potentially postponed until the reuse of the sending buffer, which in this case is just before the packing service (in the next iteration):

for(it=0; i<ITERS; it++) {
pack(data, s_buffer);

for (n=0; n<n_neigbours; n++) {
MPI_Irecv (n, r_buffer[n], irecv_reqs[n]);
MPI_Isend (n, s_buffer[n], isend_reqs[n]);
}

MPI_waitall (irecv_reqs);

unpack(r_buffer, data);

Computation(data);  // Parallelized with OpenMP

MPI_waitall (isend_reqs);
} // End of the loop on ITERS


As the waitall for a given operation is executed at the beginning of the next iteration, we also need to guarantee that the last iteration has the corresponding waitall service. We must include a waitall operation immediately following the loop, although it could be postponed until the next use of the sending buffer.

A second alternative will consist on keeping the lexicographical order, but taskify the service in order to postpone the execution (actually the finalization) until actually required. In that approach we will need to use a proper MPI threading-support, and leverage the use of task synchronization mechanism in order to guarantee task finalization:

for(it=0; i<ITERS; it++) {
pack(data, s_buffer);

for (n=0; n<n_neigbours; n++) {
MPI_Irecv (n, r_buffer[n], irecv_reqs[n]);
MPI_Isend (n, s_buffer[n], isend_reqs[n]);
}

MPI_waitall (irecv_reqs);

MPI_waitall (isend_reqs);

unpack(r_buffer, data);

Computation(data);  // Parallelized with OpenMP