Solving dynamic load balance problems

Pattern addressed: Dynamic load imbalance in MPI

This pattern focuses on the temporal evolution of load imbalances (LI). One possible scenario is the growth of LI over time, i.e. a gradual reduction of the computational performance. This reduction might happen at a slow rate so that it does not immediately catch the attention of the programmer. (more...)

In order to dynamically balance the computational work among MPI ranks, a small set of routines must be implemented that perform the following tasks:

  1. measure computational load, i.e. computational cost per work package for each MPI rank
  2. remove and add work packages from a single MPI rank
  3. transfer work packages among MPI ranks
  4. determine the optimized distribution of work packages based on measured computational cost and current number of work packages

The concept of dynamic load balancing (DLB) is universal in the sense that it can be applied to all MPI programs that perform some kind of work distribution, e.g. domain decomposition. The definition of a single work package is application specific. In the following description of the routines listed above, a work package corresponds to a set of cells in a cfd solver for illustration. The routines realizing dynamic load balancing are closely related to the domain decomposition scheme through the re-distribution of work packages. The routines listed above require an understanding of the data structures organizing the work packages and the domain decomposition. One MPI rank is orchestrating the load balancing, this rank is referred to as “load balance master” (LBM) from here on.

Function 1: measurement of computational load

Each MPI rank performs time measurements to determine the average time required per work item (Tw). How this measurement can be implemented in detail, depends on the application. For example, there could be a loop processing work items and the execution time of that loop could be measured by calling the function gettimeofday. The LBM collects Tw and the number of work items (Nw) from all ranks.

Function 2: remove and add work items

Each MPI rank has to be able to add and to remove work packages. In the example of the cfd solver, this means that each MPI rank must be able to add and to remove cells from its domain. This process is usually is the most time consuming of the four routines because data structures need to be reorganized. For example, certain pre-processing computations might have to be be updated, data arrays might have to be extended, data items might have to be inserted and so on. During this process, the memory usage of the MPI process might increase and care must be taken not to exceed the amount of available memory. If free memory is rare, the process of adding and removing work packages could be performed one rank at the time within a single node. The removal and the addition of cells is usually interrupted by the transfer of work packages among MPI ranks. At this point, it becomes clear that dynamic load balancing itself comes at a cost and that the process itself can be rather unbalanced in the time required for each MPI rank. Therefore, the frequency of performing load balancing needs to be controlled in order maximize the saved CPU time.

Function 3: transfer work packages among MPI ranks

The implementation of these routines is usually straight forward. First, work packages marked for removal are collected. Second, the data is transferred to the new MPI rank. Finally, the work packages are stored before they are added to the new MPI rank. All MPI ranks need this functionality.

Function 4: determine optimized distribution of work packages

The LBM is responsible for computing the new distribution of work packages. This new distribution is better than the previous distribution and reduces load imbalance. The aim is not necessarily to achieve perfect load balance through a single step of dynamic load balancing. In fact, it might not be possible to achieve perfect load balance with the scheme described below if the load is distributed very inhomogeneously.

First of all, the LBM collects Tw and Nw from all ranks. The next step is to compute the number of work packages that is transferred from MPI rank i to MPI rank j (Tij). This is the point where the domain decomposition and more precisely, the shapes into which the domain is divided becomes relevant. The following discussion is again using a cfd solver as an example.

All Tij should have the following properties:

  • Tij=0 if Tji != 0, transfer cells in one direction only for any pair of MPI ranks
  • Tij=Tji=0 for MPI ranks, i.e. domains, that do not connect in physical space

In order to minimize the work performed during dynamic load balancing, cells should be exchanged among neighboring MPI ranks, i.e. domains, only. At the same time, the exchange of work packages only among neighbors allows reaching a steady state with balanced load. Domain decomposition patterns that are suitable for dynamic load balancing are for example slices along an axis and cylindrical sectors. However, when the number of domains increases, the approach of slicing runs into the limitation that one slice has to be at least one layer of cells thick. The approach of dividing into cylindrical sectors has the disadvantage that neighboring domains do not connect physically when the domains become small due to an increase in the number of domains.

The solution to the limitations described above is to arrange in the cells in a string: the string starts in one corner of the computational domain. Cells are added to the end of the string and the sequence of the cells is along rows or columns or aisles. The domain is decomposed by cutting the string of cells into sub strings. In theory, a sub string could consist of a single cell so that the number of domains could be equal to the number of cells. The image below shows a quadratic domain with cells labeled from 1 to 36. The domain is partitioned by cutting the string so that for example cell 1 to 8 form domain 1, cells 9 to 18 form domain 2 and so on. The concept works in 3D also.

string_domain

With the conditions discussed above, the computation of the new distribution of work packages reduces to computing the number of work packages that need to be exchanged between neighboring domains. The number of work packages to transfer between domain i and i+1 is referred to as S[i], where i=0..n-2. If S[i]>0, the transfer is from domain i to to domain i+1 and in the other direction if S[i]<0. The initial strategy for the computation of S[i] is to aim for equal compute time among MPI ranks:

  • For each rank, compute the average compute time per work item C[i]=Tw[i]/Nw[i].
  • Compute the target compute time Tt=sum(Tw[i])/n where n is the number of domains.
  • S[0]=(Tw[0]-Tt)/C[0] // for domain 0
  • S[i]=(Tw[i]+S[i-1]*C[i]-Tt])/C[i] // for domain 1 to n-2

Dynamic load balancing algorithm

Some criterion needs to be defined that triggers the execution of the load balancing algorithm. This could be a certain number of cycles or a period of time, for example. Anyway, all MPI ranks must enter the code that performs DLB. The following steps are performed by the DLB algorithm:

  1. One rank becomes the load balancing master
  2. All ranks call the function “measurement of computational load”
  3. The LBM collects Tw and Nw from all ranks and determines the imbalance in computation time. If the imbalance is below a certain threshold, all MPI ranks exit the DLB code here.
  4. The LBM executes the function “determine optimized distribution of work packages” and distributes all S[i] to all MPI ranks.
  5. All MPI ranks evaluate S[i] to determine the number of work packages each rank has to receive and to send. In detail, rank i needs to evaluate S[i] and S[i-1]. The exceptions are rank 0 which evaluates S[0] only and rank n-1 which evaluates S[n-2] only.
  6. All MPI ranks remove the work packages that are to be sent from their data structures and prepare them for send.
  7. All MPI ranks perform send and receive of work packages.
  8. All MPI ranks add the received work packages to their data structures.

Recommended in program(s): RankDLB Master ·

Implemented in program(s): RankDLB V1 ·