The False communication-computation overlap kernel is a synthetic program which reproduces a communication/computation pattern between several MPI processes.
The kernel initially defines the execution parameters either from the input or predefined values. Then, it starts the setup phase: It initializes the MPI environment, allocates the data structures (buffers included), and computes the neighbours for the MPI communications. Even ranks will communicate with the following rank (rank+1) and odd ranks with the previous (rank-1), so each process will have just one neighbour. Right after the setup phase it starts the main loop of the algorithm.
The structure of the main loop is divided in different sub-phases. It starts determining the message size for each process, based on an input parameter, to generate an imbalance between processes. Next, it packs the data to send to the neighbours. After that, each rank communicates with its neighbours with non-blocking send and receive calls and waits for all of them to complete. When the communication phase ends, the data is unpacked and the program performs the main computation. All of that is repeated with the message size inverted for each process and finally the program does a barrier to simulate an MPI collective.
The following pseudo-code summarizes this behaviour:
for (i = 0; i < ITERS; i++) {
rSIZE = ((rank+i) % 2) ? SIZE*RATIO : SIZE;
sSIZE = ((rank+i) % 2) ? SIZE : SIZE*RATIO;
pack(data, s_buffer, sSIZE);
for(n = 0; n < n_neighbours; n++){
MPI_Irecv(r_buffer, rSIZE, ..., neighbours[n], ..., &irecv_req[n]);
MPI_Isend(s_buffer, sSIZE, ..., neighbours[n], ..., &isend_req[n]);
}
if(n_neighbours){
MPI_Waitall(n_neighbours, irecv_req, irecv_stat);
MPI_Waitall(n_neighbours, isend_req, isend_stat);
}
unpack(r_buffer, data, rSIZE);
computation(...);
// Half of the iteration, now we invert the send/receive message size
rSIZE = ((rank+i) % 2) ? SIZE : SIZE*RATIO;
sSIZE = ((rank+i) % 2) ? SIZE*RATIO : SIZE;
pack(data, s_buffer, sSIZE);
for(n = 0; n < n_neighbours; n++){
MPI_Irecv(r_buffer, rSIZE, ..., neighbours[n], ..., &irecv_req[n]);
MPI_Isend(s_buffer, sSIZE, ..., neighbours[n], ..., &isend_req[n]);
}
if(n_neighbours){
MPI_Waitall(n_neighbours, irecv_req, irecv_stat);
MPI_Waitall(n_neighbours, isend_req, isend_stat);
}
unpack(r_buffer, data, rSIZE);
computation(...);
MPI_Barrier(...);
}
The main issue within that kernel is the wait for the non-blocking send operations, which prevents the program to proceed with the computations earlier. As the send buffer is not used during the computation, the MPI_Waitall can be delayed to avoid that.