In some algorithms it is possible to divide the whole computation into smaller, independent subparts. These subparts may then be executed in parallel by different workers. Even though the data, which is worked on in each subpart, might be well balanced in terms of memory requirements there may be a load imbalance of the workloads. This imbalance may occur if the computational complexity of each subpart depends on the actual data and cannot be estimated prior to execution. (more...)

Required condition: If a work imbalance is present and the number of work packages is greater than the number of threads or workers

There are multiple approaches to tackle load imbalances in an application. This best practice shows how over-decomposition with e.g., OpenMP tasking can help to reduce load imbalances as an alternative to classic loop-based solutions. In the initial high-level example that has been presented in the pattern description, the workload of each iteration might be different or depending on i but is not known beforehand. Distributing the iteration across threads can then lead to load imbalances. A conventional loop-based approach could use a dynamic schedule to mitigate the imbalance by dynamically scheduling chunks of iterations to idling threads.

// Imbalances scenario with load distribution not known a-priori
#pragma omp parallel for schedule(static)
for (i=0; i<N; i++) {
do_work(i);    // actual amount of work depends on i but not known a-priori
}

// Conventional parallelization with OpenMP for and dynamic schedule with chunk size C
#pragma omp parallel for schedule(dynamic, C)
for (i=0; i<N; i++) {
do_work(i);    // actual amount of work depends on i but not known a-priori
}


// Over-decomposition using OpenMP taskloop (convenience feature from OpenMP)
#pragma omp parallel
{
#pragma omp single
{
for (i=0; i<N; i++) {
do_work(i);    // actual amount of work depends on i but not known a-priori
}
}
// tasks will be exeucted in the implicit barrier
}

#pragma omp parallel
{
#pragma omp for schedule(static)
for (i=0; i<N; i++) {
do_work(i);    // actual amount of work depends on i but not known a-priori
}
// tasks will be exeucted in the implicit barrier
}


It should be noted that creating tasks and their corresponding data environments is associated with some overhead. However, if the granularity of those tasks is not too fine-grained the performance is comparable with a version that uses a dynamic schedule. The following results illustrate the performance based on an imbalanced sample workload.

Version Execution time
for with schedule(static) 4.310 sec
for with schedule(dynamic) 2.456 sec
taskloop 2.443 sec
for with explicit task construct 2.406 sec

Further, using tasking could have several adavantages over conventional loop based parallelization which are detailed in the following:

#### Lifting the restriction to simple for/do loops

Constructs like #pragma omp for are restricted to code regions that have the form of a simple for loop. Other loops like while loops or loops where the loop boundaries are not known in adavance can not be parallelized using that construct. In contrast, OpenMP tasking can be used with much more irregular and also recursive codes where work-packages (tasks) can be created from different places in the code and where the computational hot spot is not just comprised of a simple for loop. The following code snippet illustrates an irregular scenario using recursion:

long fib(long n) {
long x, y;
if (n < 2) {
return n;
} else if (n < 30) {
// manual cut-off using serial version that does not create more tasks. implementation omitted
return ser_fib(n);
}

x = fib(n - 1);

y = fib(n - 2);

// wait until child tasks have been completed

return x + y;
}

// Scenario with load is not known a-priori and the structure is not in form of a simple for loop
long fibonacci;
#pragma omp parallel
#pragma omp single
{
fibonacci = fib(50);
}


In order to prevent high overhead caused by too fine-granular tasks, a manual cut-off is used that, at a certain threshold, switches to a serial execution. The following results illustrate the performance scaling of such a recursive code with varying number of threads.

Number of Threads Execution time Speedup Efficiency
1 (serial) 30.53 sec 1.00 100.00 %
2 15.22 sec 1.99 99.70 %
4 7.73 sec 3.92 98.16 %
8 4.09 sec 7.42 92.76 %
12 2.78 sec 10.91 90.98 %
16 2.09 sec 14.52 90.76 %

#### Support for dependencies between tasks

Sometimes it is required to complete a certain work item before starting another. Many tasking approaches such as OpenMP tasking provide means to specify dependencies between tasks that are considered during the scheduling process. This allows more freedom to specify the desired workload and dataflow while maintaining the possibility to utilize as much as parallelism as possible. A simplified example is illustrated in the following code snippet.

double x1, x2, y1, y2, z;
#pragma omp parallel
{
#pragma omp single
{
x1 = do_long_computation(100.0);

x2 = do_long_computation(150.0);

y1 = do_long_computation(x1);

y2 = do_long_computation(x2);


One disadvantage of a classical dynamic schedule on #pragma omp for is that it might lead to performance declines on NUMA architectures when threads execute iterations that access data on a remote NUMA domain. In contrast, several tasking frameworks provide means to express data affinity for tasks that indicate which data will be accessed by the task. For example, OpenMP 5.0 introduced the affinity clause for the #pragma omp task construct. This information could then be leveraged by the runtime system to guide the task scheduling leading to better data locality while maintaining load balancing capabilites. There is also active research and development to extend the taskloop with an option to express dependencies between and affinity for tasks created by that construct.