Some OpenMP application developers manually distribute the iterations of loops in parallel regions instead of using OpenMP’s worksharing constructs. This can prevent the runtime from using optimized schedules. (more...)
Required condition: Unequal distribution of work in OpenMP loops
When the individual iterations of an for loop have significant differences in
runtime, a manual distribution or the naive OpenMP for loop might lead to
sub-optimal results. However, OpenMP provides several advanced worksharing
constructs that help the developer to mitigate this situation. OpenMP provides
several scheduling options to distribute the loop iterations among the threads
and, since OpenMP 4.5, the taskloop
which creates tasks for the loop
iterations that are scheduled by the OpenMP runtime.
The code used in the related pattern could be rewritten (and simplified) as:
#pragma omp parallel for default(shared) num_threads(nThreads) schedule(????)
{
for (int p = 0; p <= nSize; ++p)
{
//do work
}
}
Programmers may leverage the use of advanced scheduling options by means of the
schedule()
clause. OpenMP provides three options to distribute loop
iterations among threads:
Static schedule: The schedule(static, chunk-size)
clause of the loop
construct specifies that the for loop has the static scheduling type. OpenMP
divides the iterations into chunks of size chunk-size
and it distributes
the chunks to threads in a circular order. When no chunk-size
is specified,
OpenMP divides iterations into chunks that are approximately equal in size
and it distributes at most one chunk to each thread.
Dynamic schedule: The schedule(dynamic, chunk-size)
clause of the loop
construct specifies that the for loop has the dynamic scheduling type. OpenMP
divides the iterations into chunks of size chunk-size
. Each thread executes
a chunk of iterations and then requests another chunk until there are no more
chunks available. There is no particular order in which the chunks are
distributed to the threads. The order can change each time the loop is
executed. If no chunk-size
is specified, it defaults to one.
Guided schedule: The schedule(guided, chunk-size)
is similar to the
dynamic
scheduling type. OpenMP again divides the iterations into chunks.
Each thread executes a chunk of iterations and then requests another chunk
until there are no more chunks available. The difference with the dynamic
scheduling type is in the size of chunks. The size of a chunk is proportional
to the number of unassigned iterations divided by the number threads.
Therefore the size of the chunks decreases. The minimum size of a chunk is
set by chunk-size
. However, the chunk which contains the last iterations
may have smaller size than chunk-size
. Again, if chunk-size
is not
specified, it defaults to one.
Programmers may also have the option of using the OpenMP taskloop
construct.
The loop is cut into chunks and the OpenMP runtime creates a task for each loop
chunk. In order to transform our code, we should create a single region in
order to guarantee one instantiation of the loop:
#pragma omp parallel default(shared) num_threads(nThreads)
{
#pragma omp single
#pragma omp taskloop [grainsize(grain-size) | numtasks(num-tasks)]
for (int p = 0; p <= nSize; ++p)
{
//do work
}
}
Then, the size of the chunks can be controlled with the grainsize(grain-size)
clause, where each chunk has at least grain-size and at most 2*grain-size
iterations; or the num_tasks(num-tasks)
clause, where the runtime will
createas many tasks as the minimum of the num-tasks parameter and the number of
loop iterations.
Leveraging OpenMP worksharing constructs has multiple advantages to manual workload distribution. Even with perfectly balanced iterations, OpenMP worksharing constructs offer more flexibility and are more maintainable and less error prone than manual distribution. In the case of unbalanced iterations a smarter distribution of iterations across the threads can increase the load-balancing and yield higher performance.