Home Resources for Co-Design Best-practices Leverage-omp-worksharing

Leveraging OpenMP worksharing constructs

Pattern addressed: Loop iterations manually distributed

Some OpenMP application developers manually distribute the iterations of loops in parallel regions instead of using OpenMP’s worksharing constructs. This can prevent the runtime from using optimized schedules. (more...)

Required condition: Unequal distribution of work in OpenMP loops

When the individual iterations of an for loop have significant differences in runtime, a manual distribution or the naive OpenMP for loop might lead to sub-optimal results. However, OpenMP provides several advanced worksharing constructs that help the developer to mitigate this situation. OpenMP provides several scheduling options to distribute the loop iterations among the threads and, since OpenMP 4.5, the taskloop which creates tasks for the loop iterations that are scheduled by the OpenMP runtime.

The code used in the related pattern could be rewritten (and simplified) as:

#pragma omp parallel for default(shared) num_threads(nThreads) schedule(????)
{
  for (int p = 0; p <= nSize; ++p) 
  {
    //do work
  }
}

Programmers may leverage the use of advanced scheduling options by means of the schedule() clause. OpenMP provides three options to distribute loop iterations among threads:

Static schedule: The schedule(static, chunk-size) clause of the loop construct specifies that the for loop has the static scheduling type. OpenMP divides the iterations into chunks of size chunk-size and it distributes the chunks to threads in a circular order. When no chunk-size is specified, OpenMP divides iterations into chunks that are approximately equal in size and it distributes at most one chunk to each thread.
Dynamic schedule: The schedule(dynamic, chunk-size) clause of the loop construct specifies that the for loop has the dynamic scheduling type. OpenMP divides the iterations into chunks of size chunk-size. Each thread executes a chunk of iterations and then requests another chunk until there are no more chunks available. There is no particular order in which the chunks are distributed to the threads. The order can change each time the loop is executed. If no chunk-size is specified, it defaults to one.
Guided schedule: The schedule(guided, chunk-size) is similar to the dynamic scheduling type. OpenMP again divides the iterations into chunks. Each thread executes a chunk of iterations and then requests another chunk until there are no more chunks available. The difference with the dynamic scheduling type is in the size of chunks. The size of a chunk is proportional to the number of unassigned iterations divided by the number threads. Therefore the size of the chunks decreases. The minimum size of a chunk is set by chunk-size. However, the chunk which contains the last iterations may have smaller size than chunk-size. Again, if chunk-size is not specified, it defaults to one.

Programmers may also have the option of using the OpenMP taskloop construct. The loop is cut into chunks and the OpenMP runtime creates a task for each loop chunk. In order to transform our code, we should create a single region in order to guarantee one instantiation of the loop:

#pragma omp parallel default(shared) num_threads(nThreads)
{
  #pragma omp single
  #pragma omp taskloop [grainsize(grain-size) | numtasks(num-tasks)]
  for (int p = 0; p <= nSize; ++p) 
  {
    //do work
  }
}

Then, the size of the chunks can be controlled with the grainsize(grain-size) clause, where each chunk has at least grain-size and at most 2*grain-size iterations; or the num_tasks(num-tasks) clause, where the runtime will createas many tasks as the minimum of the num-tasks parameter and the number of loop iterations.

Leveraging OpenMP worksharing constructs has multiple advantages to manual workload distribution. Even with perfectly balanced iterations, OpenMP worksharing constructs offer more flexibility and are more maintainable and less error prone than manual distribution. In the case of unbalanced iterations a smarter distribution of iterations across the threads can increase the load-balancing and yield higher performance.

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 676553 (POP1) and 824080 (POP2).

Currently, the project receives funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101143931 (POP3). The JU receives support from the European Union's Horizon Europe research and innovation programme and Spain, Germany, France, Portugal and the Czech Republic.