MPI programmers often use non-blocking calls to overlap communication and computation. In such codes, the MPI process communicates with its neighbors through a sequence of stages: 1) an optional pack of data (if needed); 2) a set of send/receive non-blocking operations, which potentially could overlap one to each other; 3) wait for communications (potentially splitting for send and receive requests; and 4) the computational phase.
A frequent code structure could be represented as follows:
for(it=0; i<ITERS; it++) {
pack(data, s_buffer);
for (n=0; n<n_neigbours; n++) {
MPI_Irecv (n, r_buffer[n], irecv_req[n]);
MPI_Isend (n, s_buffer[n], isend_req[n]);
}
MPI_waitall (irecv_reqs);
MPI_waitall (isend_reqs);
unpack(r_buffer, data);
Computation(data); // Parallelized with OpenMP
} // End of the loop on ITERS
The main problem of the aforementioned pseudo-code is that programmers treat equally the received operations, needed for the computational phase, and the send operations, which in general are not required in order to start the Computational phase. This waitall may unnecessarily delay the continuation of the program.
This pattern was observed, for example, in the analysis of the IFS weather code, and the PySDC parallel-in-time solver.
The following Figure shows the details of one of the communication phases on the IFS application:
The light green MPI calls are the waitall operations for the non-blocking sends made by the process. They are quite long (tens of ms in a 420 MPI processes x 4 OpenMP threads run). It might me good to explore the possibility of postponing them and advance the computation that follows them. Doing the waitalls just before the following alltoallv phase (corresponding to the gold color in the Figure) has the potential of reducing the time in these calls.
Recommended best-practice(s): Related program(s):