Conditional nested parallel region within an unbalanced phase

In cases where the number of iterations (N) is equal to the number of workers (W) nested parallelism inside the straggling workers can be used to decrease the load imbalance. The idea is to utilize the unused computational resources of workers already finished and thus idling.

Assuming sufficient granularity in do_work, i.e. k » W, the best-practice can be applied to the following serially executed do_work:

do_work(int i) {
    for ( k ) {
        // do parallelizable internal work
    }
}

by applying the following code modifcations:

int freeWorkers;
#pragma omp atomic write
freeWorkers = 0;

#pragma omp parallel for
for (i=0; i<N; i++) {
    do_work(i, &freeWorkers);    // actual amount of work depends on i but not known a priori
    #pragma omp atomic update
    freeWorkers++;
}
do_work(int i, int freeWorkers) {
    int addWorkers = 0;
    int nWorkers = OMP_GET_NUM_THREADS();

    #pragma omp atomic capture
    {
    addWorkers = freeWorkers
    freeWorkers = 0
    }

    if (addWorkers <= 0.75 * nWorkers) {
        #pragma omp atomic update
        freeWorkers = freeWorkers + addWorkers;

        addWorkers = 0;    
    }

    #pragma omp parallel for num_threads(1+addWorkers)
    for( k ) {
        // do internal work in parallel
    }

    if (addWorkers > 0) {
        #pragma omp atomic update
        freeWorkers = freeworkers + addWorkers
    }
}

Before performing the internal work inside do_work it is checked whether other workers have already finished their workload (i.e. freeWorkers > 0). The condition if (addWorkers <= 0.75 * nWorkers) ensures that nested parallelism is only used when most workers (more than 75% in this case) are already finished with their workload. In other words nested parallelism is only used inside the unbalanced phase. If there are free resources the current worker will use these resources to speed up its own computation because the parallel region enclosing the internal work will then be executed by a number of 1 + addWorkers workers.

Using this best-practice in the CalculiX application, which showed the related pattern, a single call of the solver could be speed up by a factor of approximately 1.16.

The best-practice as described above only works in case there is only one straggling worker. In case there are multiple stragglers the best-practice may be implemented as follows:

do_work(int i, int freeWorkers) {
    int addWorkers = 0;
    int nWorkers = OMP_GET_NUM_THREADS();
    int stragglingWorkers = 0; 
    int remainingAddWorkers = 0;   

    #pragma omp atomic capture
    {
    addWorkers = freeWorkers
    freeWorkers = 0
    }
    
    if (addWorkers <= 0.75 * nWorkers) {
        #pragma omp atomic update
        freeWorkers = freeWorkers + addWorkers;

        addWorkers = 0;    
    }
    else {
       stragglingWorkers = nWorkers - addWorkers;
       remainingAddWorkers = addWorkers - (addWorkers / stragglingWorkers);
       addWorkers = addWorkers - remainingAddWorkers;
       
       #pragma omp atomic update
       freeWorkers = freeWorkers + remainingAddWorkers;
    }

    #pragma omp parallel for num_threads(1+addWorkers)
    for( k ) {
        // do internal work in parallel
    }

    if (addWorkers > 0) {
        #pragma omp atomic update
        freeWorkers = freeWorkers + addWorkers
    }
}

In this implementation a straggling worker computes how many other straggling workers are there currently. This information is then used to only use a fair amount of the free resources such that every straggling worker can use an equal amount of free resources to speed up its computation.