The essential purpose of parallel programming is to reduce the runtime by dividing time-demanding computations among available resources. The granularity is an important aspect that is often overlooked. The amount of work assigned to a computational core/thread has to have a reasonable size so that the overhead caused by thread/core control does not degrade the performance. When the tasks or chunks are too small, this overhead can cost the same or even more time than the computation itself. In such a case, running a program in parallel may not bring any benefit or even can consume more time and resources than running as a single thread application. (more...)
Required condition: A large number of iterations - number of iteration is significantly larger (many times) than the number of threads.
A trade-off between parallelism and overhead in terms of the POP metrics corresponds to balancing the Load Balance and Communication efficiency. In general, we aim to create enough chunks (parallelism) in order to utilize all available computational resources (threads). At the same time, we need to keep the related overhead low. Less overhead is reflected by higher Communication efficiency. On the other hand, scheduling of threads at the runtime is often needed in order to balance the workload effectively. This leads to better Load Balance at the cost of the lower Communication Efficiency.
It is important to ensure large enough chunks/tasks in order to decrease the effect of the overhead. A reasonable amount of the overhead should be less than 10% of the total runtime. Less is better. Having the overhead under 5% of the total runtime is more favorable. For example: if a computational chunk/task with the duration of 50 us is followed by a runtime scheduling that takes around 1-3 us, then the overhead is of 2-6 % only.
We may control the granularity directly by specifying the size of chunks/tasks or secondarily by specifying the total number of chunks/tasks. The size of individual chunks for the OpenMP loop construct is set using
chunk_size within the
schedule clause and for taskloop construct using the clause
grainsize. For the taskloop construct, one can directly specify the total number of tasks using the
num_tasks clause. While for the loop construct, to specify the total number of chunks, one has to compute the appropriate chunk_size number based on
chunk_size = (nIters / nChunks) + 1. These two variants are described in more detail below.
The number of iterations within a chunk/task should be specified based on the duration of individual iterations such that the duration of the chunk/task is at least 50 us, which ensures a reasonable overhead. This approach requires a knowledge of the iteration duration, which probably has to be obtained from a measurement provided by the application itself or using some external performance tool. Then the
chunk_size is computed as
chunk_size = 50 / t_it, where
t_it is the duration of an iteration (average value) and
50 corresponds to the required duration of a chunk.
Changing the size of chunks/tasks may affect also other performance aspects, for example, the IPC. Therefore, it is better to repeat the measurement after setting the chunk_size and recompute the value again. This procedure may be repeated a few times.
t_chunkand compute avg time per one iteration,
t_it = t_chunk/chunk_size
chunk_size = 50 / t_it
t_chunkis really close to 50 us.
The actual value for an optimal chunk duration depends on the hardware. Iteration runtimes distribution is a key factor that may make finding an optimal chunk_size difficult. Especially in case of large differences or a special distribution with increasing/decreasing iteration runtimes.
Another approach is to focus on the total number of chunks/tasks. Then the number of iterations within a chunk/task is computed in order to obtain a certain number of chunks/tasks:
chunk_size = (nIters / nChunks) + 1. As we aim to distribute the workload equally among threads, it is desirable for the number of chunks to be divisible by the number of threads. Thus a fixed number of chunks is assigned to each tread (
nChunksPerThread). The total number of chunks is then expressed as
nChunks = nChunksPerThread * nThreads and the number of iterations within a chunk/task may be obtained as
chunk_size = (nIters / (nChunksPerThread * nThreads)) + 1.
This approach is suitable for loops with identical or well-balanced iterations.
Assuming an equal workload distribution across iterations and an ideal machine without noise, this approach leads to a good Load Balance since all threads process the same number of chunks and therefore the same number of tasks/iterations. On real machines, the Load Balance is more sensitive to noise. Based on the granularity, we may choose the number of chunks per thread (
nChunksPerThread) such that the duration of a chunk is longer than 50 us. Using this approach we typically assign a lower number of chunks of larger sizes to individual threads and therefore the overhead is low.
chunk_size = (nIters / (nChunksPerThread * nThreads)) + 1
nChunksPerThreadand recompute the
nChunksPerThreadand recompute the
Applying the formula above, the
chunk_size was set to 4654 iterations. Then the total runtime of the loop is 80 us. Respective
State view is presented in the figure below. Taking only one chunk per thread is not optimal regarding the Load balance - we can see the value gets lower to 92%, but the total runtime of the loop is significantly shorter.
The duration of the long chunks is around 54 us. the average Scheduling and Fork/Join state duration is 1.9 us. Therefore, the overhead takes around 4% of the total runtime. Additional parallel efficiency degradation is caused by the synchronization (implicit barrier) at the end of the loop. Due to the short duration of computations in the loop, this synchronization is noticeable. IPC value is equal to 1.58.
|- Load Balance||0.92|
|- Communication eff.||0.84|
For the next example, we assign five chunks to each thread, therefore the
chunk_size is set to 931 iterations. The total runtime of the loop is 105 us. Respective
State view is presented in the figure below. The Load Balance is 94%, while the Communication efficiency is 78% (with the synchronization state included).
The average chunk duration is 12 us, the average Scheduling and Fork/Join state duration is 1.9 us. Therefore, the overhead takes around 15% of the total runtime. Additional degradation of the parallel efficiency is caused by the synchronization (implicit barrier) at the end of the loop.
|- Load Balance||0.94|
|- Communication eff.||0.78|
Recommended in program(s): BEM4I miniApp ·