Home Resources for Co-Design Patterns High-communication-between-ranks

High weighted communication in between ranks

Usual symptom(s):

Transfer Efficiency: The Transfer Efficiency (TE) measures inefficiencies due to time in data transfer. (more...)

This pattern can be observed in parallel algorithms where a large fraction of the runtime is spent calculating data on individual processes and then communicating that data between processes. An example of such a parallel algorithm is molecular dynamics, where N particles interact with each other. The equation of motion to be solved is

\[\bf {F}= m \sum \bf {a}\]

\(\bf {F}\) is the vector of the total force, N is the total number of particles, \(\bf {a}_{i,j}\) is the acceleration vector and it has to be computed for every particle w.r.t. every other.

A frequently used approach is to use the Atom Decomposition algorithm [1], where the force computation is split among different processes:

 for process = 1, M {
    1. compute partial forces for particles assigned to this process
    2. send particle forces needed by other processes and receive particle forces needed by this process
    3. update position and velocity for particles assigned to this process
  }

where M is the total number of MPI processes. In this way, when the number of particles is large the communication due to MPI calls becomes a significant bottleneck, given that the number of MPI calls (point 2.) scales quadratically with the number of particles.

In Figure 1, the speed-up plot is shown for both cases. The larger the number of nodes, the more important the MPI overhead. This effect depends on the number of particles as well, the larger case shows a more severe slow down with respect to the smaller case.

speedup_pattern Figure 1: Speedup plot. The larger (N=100,000) and smaller cases (N=10,000) are shown in blue and red, respectively. The dark-green continuous line represents the ideal scaling, while the light-green continuous line represents the 80% ideal scaling.

In Table 1, the Global, Transfer and Serialisation efficiencies are shown for the N=10,000 particles case for 1, 2 and 4 nodes. In all cases, the MPI transfer efficiency is very low while the Serialisation efficiency is quite high but it slowly decreases with increasing number of nodes. The overall effect is a drastic reduction of the performance and very poor Global efficiency.

Number of Processes	48	96	192
Global Efficiency	0.05	0.05	0.06
MPI Transfer Efficiency	0.12	0.13	0.16
MPI Serialisation Efficiency	0.12	0.13	0.89

Table 1: POP metrics for the smaller cases (N=10,000).

[1] Plimpton et al. (1995) [https://doi.org/10.1006/jcph.1995.1039]

Recommended best-practice(s):

Replicating computation to avoid communication (gemm)

Related program(s):

High weighted communication in between ranks

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 676553 (POP1) and 824080 (POP2).

Currently, the project receives funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101143931 (POP3). The JU receives support from the European Union's Horizon Europe research and innovation programme and Spain, Germany, France, Portugal and the Czech Republic.