The OpenMP standard provides a critical section construct, which only allows one thread to execute the block of code within the construct. This feature allows blocks of code to be protected from race conditions, for example with write accesses into a shared array or incrementing a shared counter. However, usage of this construct, especially within parallel loops, can severely reduce performance. This is due to serialisation of the execution causing threads to “queue” to enter the critical region, as well as introducing large lock-management overheads required to manage the critical region. (more...)
This best practice recommends removing the critical statement from the parallel region. This can be achieved by moving the critical region outside of the loop, often by caching per-thread results in some way, and finalising the results in a single or master only region. This trades the serialisation and lock-management overheads for some additional serial execution but will often lead to overall performance improvement due to performance gains in the parallel region.
The pattern’s related pseudo-code could be refactored to:
TYPE result = INIT;
TYPE my_array[NUM_THREADS] = {INIT, INIT,..., INIT};
#pragma omp parallel
{
int id = omp_get_thread_num();
#pragma omp for
for ( int i = 0; i < Ni; i++ ) {
// work on arrays
my_array[id] = f(my_array[id], <parameters>);
}
}
for (int i=0; i < omp_get_max_threads(), i++){
result = f(result, my_array[i]); // critical block
}
The previous solution could create false-sharing when accessing the array of
partial results. This additional performance problem could be solved by adding
some padding in the array declaration (e.g., TYPE
my_array[NUM_THREADS][PADDING];
, where padding is big enough to overpass a
cache line). The access to the new array declaration will use one extra
dimension at any arbitrary position, for instance: first position (i.e.,
my_array[id]
-> my_array[id][0]
).
Implemented in program(s): OpenMP Critical (critical section replaced by master-only execution) ·