# Conditional nested parallel region within an unbalanced phase

In cases where the number of iterations (N) is equal to the number of workers (W) nested parallelism inside the straggling workers can be used to decrease the load imbalance. The idea is to utilize the unused computational resources of workers already finished and thus idling.

Assuming sufficient granularity in do_work, i.e. k » W, the best-practice can be applied to the following serially executed do_work:

do_work(int i) {
for ( k ) {
// do parallelizable internal work
}
}


by applying the following code modifcations:

int freeWorkers;
#pragma omp atomic write
freeWorkers = 0;

#pragma omp parallel for
for (i=0; i<N; i++) {
do_work(i, &freeWorkers);    // actual amount of work depends on i but not known a priori
#pragma omp atomic update
freeWorkers++;
}

do_work(int i, int freeWorkers) {

#pragma omp atomic capture
{
freeWorkers = 0
}

if (addWorkers <= 0.75 * nWorkers) {
#pragma omp atomic update

}

for( k ) {
// do internal work in parallel
}

#pragma omp atomic update
}
}


Before performing the internal work inside do_work it is checked whether other workers have already finished their workload (i.e. freeWorkers > 0). The condition if (addWorkers <= 0.75 * nWorkers) ensures that nested parallelism is only used when most workers (more than 75% in this case) are already finished with their workload. In other words nested parallelism is only used inside the unbalanced phase. If there are free resources the current worker will use these resources to speed up its own computation because the parallel region enclosing the internal work will then be executed by a number of 1 + addWorkers workers.

Using this best-practice in the CalculiX application, which showed the related pattern, a single call of the solver could be speed up by a factor of approximately 1.16.

The best-practice as described above only works in case there is only one straggling worker. In case there are multiple stragglers the best-practice may be implemented as follows:

do_work(int i, int freeWorkers) {
int stragglingWorkers = 0;

#pragma omp atomic capture
{
freeWorkers = 0
}

if (addWorkers <= 0.75 * nWorkers) {
#pragma omp atomic update

}
else {

#pragma omp atomic update
}