In some codes, the problem space is divided into a multi-dimensional grid on which the computations are performed. This pattern typically results in codes such as
jCFD_Genesis, with multiple nested loops iterating over the grid, as shown in the following code snippet.
DO K = 1, Nk DO J = 1, Nj DO I = 1, Ni !! work to do END DO END DO END DO
In the loop above, the work is related to computations from a real-case, where the results of multiplication and addition between elements of multiple three-dimensional matrices are stored in an output three-dimensional matrix of the same size.
When the computations on the grid are independent one from another it is possible to compute the values in parallel. In addition, there is often an outer loop iterating over multiple time steps. When implementing this pattern in parallel using OpenMP it is common to parallelize the loops as in the following code snippet, using a regular
!$OMP PARALLEL DO directive.
DO iter = 1, Niter !$OMP PARALLEL DO DEFAULT(NONE) SHARED(...) DO K = 1, Nk DO J = 1, Nj DO I = 1, Ni !! work to do END DO END DO END DO !$OMP END PARALLEL DO END DO
However, poor load balance can occur because the number of iterations in the outer loop can’t be distributed evenly over the threads. Consider, for instance, the case where there are 22
K-loop iterations (
Nk=22), and 21 threads. During execution of the 21th iteration, 20 of the 21 threads are idle.
In Figure 1, an Extrae timeline window for three outer loop iterations with 10, 18 and 48 threads, is shown. The code executed inside the inner loop is the same for all values of
I. As the number of threads increases the distribution of work becomes significantly imbalanced.
Timeline view showing three outer iterations for a regular
!$OMP PARALLEL DO nested loop with
Figure 2 shows the speedup plot for
Nj=42, illustrating the impact of the imbalance when the number of threads exceed the number of iterations in the OpenMP loop.
Figure 2: speedup plot for
It can be observed that speed-up stalls when the number of threads exceed the number of parallelized iterations (i.e.: