Overlapping computation and communication

Pattern addressed: Low transfer efficiency following large computation

Many parallel algorithms on distributed memory systems contain a certain pattern, where a computational phase is serially followed by collective communication to share computed results. Moreover, this is often present in an iterative process, thus this pattern repeats in the algorithm many times. (more...)

Parallel applications often exhibit a pattern where computation and communication is serialized. For example, the following pseudo code performs computation on some data_A and after this computation is finished, some data_B are communicated between processes:

...
compute(data_A)
communicate(data_B)
...

However, if the data_B that are communicated are different to the data_A that the computation is performed on, one can target asynchronous execution in this case. If the used communication library and hardware supports offloading of communication, the communication can then be performed at the same time as the computation. This is called overlapping of communication and computation.

To achieve overlapping the communication has to be initiated before the computation and one has to await for the completion of the computation afterwards. This can be achieved with different techniques. One approach is to split communication into multiple steps. The code for this will look like follows:

...
communicate_init(data_B)
compute(data_A)
communicate_wait(data_B)
...

This method is, e.g., provided by MPI’s non-blocking, persistent, and one-sided operations.

Another approach would be taskification of the code. Here, the communication and computation will be encapsulated into tasks. These tasks are executed by a runtime system that takes into account dependencies between tasks. The code for it will look like follows:

...
run_as_task(communicate(data_B))
run_as_task(compute(data_A))
...

This method is, e.g., provided by the hybrid MPI+OmpSs model.

Overlapping communication and computation in MPI

The MPI standard provides several communication models to exchange data between processes. Most often message based communication using send and receive is used. Here the non-blocking communication routines can be used to allow overlapping of communication and computation. An alternative is the more advanced one-sided communication model, which is closer to the hardware and support for communication offloading.

In the following, a code example for non-blocking send and receive based communication with MPI is presented:

/* Simple MPI example demonstrating the use of non-blocking communication */

double data[N]; /* data on which computation is performed */
double halo_send[H]; /* data which have to be sent */
double halo_recv[H]; /* place for data to be received */

/* initiate communication of halo data */
MPI_Request requests[2];
MPI_Irecv(halo_recv, H, ..., &request[0]);
MPI_Isend(halo_send, H, ..., &request[1]);
/* perform computation */
compute(data, N); /* perform computation */
/* wait for communication to complete */
MPI_Waitall(requests, 2, MPI_STATUSES_IGNORE);

In the following, a code example using one-sided communication with MPI is presented. In this approach a global memory window has to be made available by the processes to which other processes can directly write to or read from.

/* Simple MPI example demonstrating the use of non-blocking communication
 * This example uses memory fence based synchronization but MPI provides
 * more fine grained alternatives using locks as well. See the MPI standard
 * for more details.
 */

double data[N]; /* data on which computation is performed */
double halo_send[H]; /* data which have to be sent */
double halo_recv[H]; /* place for data to be received */

MPI_Comm comm = MPI_COMM_WORLD;
MPI_Win win;
/* expose halo_recv to all other processes allowing them to put data into it */
MPI_Win_create(halo_recv, N, ..., comm, &win);
...
MPI_Win_fence(0, win); /* start communication epoche */
MPI_Put(halo_send, N, ..., win); /* initiate transfer of data */
compute(data, N); /* perform computation */
MPI_Win_fence(0, win); /* end communication epoche */
...
MPI_Win_free(&win);

Co-Design note:

Achieving true overlap of communication and computation may not be that easy even if the hardware supports communication offloading. The parallel programming models have to cope with various problems here, which may hinder actual overlapping in the runtime implementations.

SW Co-design (MPI): A high quality MPI implementation will ensure communication progress even outside MPI calls into the runtime for the hardware used to communicate.

For example, messages to be transferred may be larger than the hardware buffers available in network devices or collective operations may need computations to be performed in reduction steps. Therefore, communication has to be performed in smaller chunks inside the runtime system and communication of each chunk may require some CPU involvement. So, the simple initialize and wait approach does not achieve the desired result. Instead, the CPU will have to be triggered from time to time to progress the communication. Different strategies exist for the triggering of the CPU: From the communication library side an event driven approach or so called progress threads are the two most common techniques being used.

However, in the case of hybrid MPI+X models complications arise and solutions for them are part of current research. An often seen workaround for this problem from the user side uses manual calls into the MPI library from within the computation to trigger communication progress. But this complicates the user code and the decision of the exact point of triggering the progress may also not be optimal. In pseudo code notation this approach would look like follows:

...
communicate_init(data_B)
for chunk in data_A { // additional code splitting original computation into pieces
   compute_chunk(chunk)
   communicate_trigger_progress() // call into runtime to trigger communication progress
}
communicate_wait(data_B)
...

Recommended in program(s): BEM4I miniApp · Dumux Dune kernel, alltoallv ·

Implemented in program(s): Dumux Dune kernel, isend · Dumux Dune kernel, issend ·