The original version of the False communication-computation overlap kernel performs the MPI_Waitall for the Irecv and Isends one after the other. At each communication point, one neighbour sends a big message and receives a small one, while the other process does the opposite. The process sending the small message will end the pack faster and send the message earlier. The other will end the pack later, end the receive faster, and start to wait for its send to be completed. This specific wait is not necessary at this point and contributes to increase the execution time.
The following code snippet identifies the source of the problem:
pack(...);
for(n = 0; n < n_neighbours; n++){
MPI_Irecv(r_buffer, rSIZE, ..., neighbours[n], ..., &irecv_req[n]);
MPI_Isend(s_buffer, sSIZE, ..., neighbours[n], ..., &isend_req[n]);
}
if(n_neighbours){
MPI_Waitall(n_neighbours, irecv_req, irecv_stat);
MPI_Waitall(n_neighbours, isend_req, isend_stat); //This wait here!
}
unpack(...);
computation(...);