Dynamic load imbalance in MPI

This pattern focuses on the temporal evolution of load imbalances (LI). One possible scenario is the growth of LI over time, i.e. a gradual reduction of the computational performance. This reduction might happen at a slow rate so that it does not immediately catch the attention of the programmer.

Dynamic LI can have several causes:

  • the algorithm, e.g. work travels among MPI ranks
  • hardware inhomogeneities, e.g. temperature, different CPUs among nodes, power capping
  • the operation system, e.g. compute nodes shared by users

Independent of the cause, dynamic LI can be detected in two different ways that will discussed below:

  1. monitoring performance during runtime
  2. tracing the application

Detecting dynamic load imbalance by monitoring performance during runtime

As MPI applications are usually designed to achieve higher computational performance, these codes often have a mechanism to measure the performance at runtime. Possible metrics are for example the cycle time and the number of processed elements per second. Dynamic LI will cause a degradation of these metrics. The image below shows an example that was created using the kernel rank-dlb with dynamic load balancing disabled. The metric used in this example is the number processed elements per second. The load imbalance is created by MPI rank 0 which continuously generates more work elements.

no_dynamic_load_balancing

This technique causes very little overhead and allows detecting LI that occurs over a longer period of time. However, this technique relies on the programmer monitoring a metric that is sensitive to LI.

Detecting dynamic load imbalance by tracing the application - without code modifications

Dynamic LI can also be detected in the trace of an application. The effect of LI is a change in the duration of calls to user functions and to MPI functions. The problem with this approach is that tracing is usually performed with the intention to keep the runtime short. Therefore, the time covered in the trace might simply be too short to reveal the presence of dynamic LI.

The trace below shows a few cycles of the kernel rank-dlb. The calls to MPI_Allgather at the end of the section belong to the code block that checks weather the load is balanced or not. The green region is the useful computation and the blue parts indicate calls to MPI_Waitall. As the load imbalance is created by rank 0 creating additional work, rank 0 spends very little time in MPI_Waitall.

trace_lb_cycle_no_lb

In order to detect dynamic LI, the time spent in various routines must be summed up over all ranks for a certain period of time and compared for different wall clock times. The values obtained for this example are listed in the table below.

routine t = 40s t = 200s
process_data 86.3 % 51.5 %
MPI_Waitall 12.2 % 43.6 %
MPI_Allgather 1.5 % 4.9 %

The time spend in process_data decreases over time while the time spent in MPI_Waitall increases, the computation becomes less efficient. The evolution of the LI must be a continuous function, otherwise it can not be reduced by dynamic load balancing. The graph below shows a continuous increase in the time spent in MPI_Wait for MPI rank 6 which indicates that the load imbalance is increasing continously.

time_in_mpi_wait

It is also possible to

Detecting dynamic load imbalance by tracing the application - with score-p user instrumentation

If the source code is available and the computation is performed in a loop, user instrumentation can be used to detect dynamic LI. This technique produces very little overhead and allows the collection of traces for long time periods. Let’s assume the loop performing the computation looks like this:

for (icycle = 0; icycle < ncycle; icycle++) {
   compute()
}

The loop can be broken up into multiple smaller loops and each one of the smaller loops is enclosed by a score-p user region.

icycle = 0

SCOREP_USER_REGION_DEFINE (handle_block_a)
SCOREP_USER_REGION_BEGIN(handle_block_a, "block_a", SCOREP_USER_REGION_TYPE_COMMON)
for (icycle = icycle; icycle < ncycle/nblock*1; icycle++) {
   compute()
}
SCOREP_USER_REGION_END(handle_block_a)

SCOREP_USER_REGION_DEFINE (handle_block_b)
SCOREP_USER_REGION_BEGIN(handle_block_b, "block_b", SCOREP_USER_REGION_TYPE_COMMON)
for (icycle = icycle; icycle < ncycle/nblock*2; icycle++) {
   compute()
}
SCOREP_USER_REGION_END(handle_block_b)

SCOREP_USER_REGION_DEFINE (handle_block_c)
SCOREP_USER_REGION_BEGIN(handle_block_c, "block_c", SCOREP_USER_REGION_TYPE_COMMON)
for (icycle = icycle; icycle < ncycle/nblock*3; icycle++) {
   compute()
}
SCOREP_USER_REGION_END(handle_block_c)

...
...
...

For each one of the score-p user regions it is possible to compute the POP metrics using CubeGUI. This technique has been applied to the kernel rank-dlb and the following values for the load balance have been obtained using the same example as above.

block # 1 2 3 4 5 6 7 8 9 10 11
Parallel efficiency 98 95 86 79 72 66 61 56 54 54 54
- Load balance 98 95 86 79 72 66 61 56 54 54 54
- Communication efficiency 100 100 100 100 100 100 100 100 100 100 100
- - Serialization efficiency 100 100 100 100 100 100 100 100 100 100 100
- - Transfer efficiency 100 100 100 100 100 100 100 100 100 100 100

The generation of additional work elements stopped in block 8 so the load balance remains constant in the following blocks. Communication efficiency is 100% in this example.

An alternative approach to investigating the variation of load balance is to instrument the loop as a dynamic region. With this type of instrumentation, statistics for individual loop iterations are collected. With the help of cube 4.6, POP metrics can be calculated for individual iterations and groups of iterations. The code inside the loop needs to be instrumented as follows:

...
SCOREP_USER_REGION_DEFINE (handle_block)
for (icycle = 0; icycle < ncycle; icycle++) {
SCOREP_USER_REGION_BEGIN(handle_block, "block_loop", SCOREP_USER_REGION_TYPE_DYNAMIC)
   compute()
SCOREP_USER_REGION_END(handle_block)
}
...

General remarks

A performance analysis reveals a significant time spent in wait state and an imbalance in the computational load per MPI rank. It is important to verify that the load imbalance is present and that the wait time is not cause by the limited speed of the network as in pattern Communication imbalance in MPI.

Recommended best-practice(s): Related program(s):
  • RankDLB Master
  • RankDLB V1
  • Sam(oa)² (OpenMP tasking)
  • Sam(oa)² (work-sharing)