Sequences of independent computations with expensive global communication

It is frequent to find codes where multiple independent operations are executed in sequence. If each operation has communication phases constituting a high percentage of its total execution time, the overall performance will be low compare with the real potential of the application.

The following synthetic skeleton illustrates a frequent code structure that generates this pattern:

for (i=0; i < ITERS; i++) {
   compute_1(<params[i]>); // OP1
   compute_2(<params[i]); // OP2

In this skeleton, there are dependencies within each iteration (i.e., OP1 -> comm. -> OP2 -> comm.), but different iterations are completely independent. The loop body is composed by two different phases, each one involving a computational sub-phase with its related communication operation (i.e., 1st phase: compute 1 with communication A; 2nd phase: compute 2 with communication B). Indeed, this pattern is not restricted to the lexicographical loop structure, but it must apply the dynamic execution of such operations. The main problem on this pattern, is that it is almost impossible to overlap the computation of any of the sub-phases with the communication of any the other ones due these two phases are sharing data and it will imply to overwrite the buffer before it can be sent, or send the buffer before it has been computed. In the other side, it happens that different iterations are still independent among them if they use different data and storage per iteration.

HW Co-design: increase the network bandwidth causing the weight of the communication to decrease in proportion to the computation. The result would be a very low latency network and high bandwidth. The cost of this solution can be very high and it also could make the application very sensitive to noise. In addition, this is not always affordable.

A good example in which this pattern applies includes the Quantum Espresso application. In this case, a sequence of independent Fast Fourier Transformations are parallelized across all processes, so the programmer can leverage the fact such FFTs can run concurrently (as far as each one uses different storage) to taskify the whole algorithm.

Although other local (i.e., intra loop iteration) improvements in the code of the communication phases could also partially apply, it is highly discouraged due dependencies between loop body phases avoid to fully leverage all their benefits. Among these improvements we can find: any refactoring of the individual communication phase (e.g., change order of communications, postpone waits for non-blocking sends, taskify packing/unpacking data, etc.). Change the order of communication will require the existence of an additional pattern causing contention in a certain MPI rank, which will be not always present. Postpone waits for non-blocking communication could only apply to a small set of instructions (in between loop phases), due the pattern explicitly mention a chain of dependencies between phases. Finally, taskify packing/unpacking algorithm could be used only if such operations appears in the code. The bottom-line of this pattern is to highlight the possibility of working in a coarser grain of parallelism, keeping these local improvements as a second level of exploitation (nesting).

Recommended best-practice(s): Related program(s):
  • FFTXlib (2.0)
  • FFTXlib (ompss)