A pattern that appears on many codes computing on a grid of topologically connected nodes is the reduction of values on the nodes with possible multiple contributions from neighboring nodes. This pattern typically results in codes with reduction operations through indirection on a large array that use atomic or critical constructs to avoid the potential data race when updating it.
This pattern occurs in finite element codes (such as Alya), n-body codes, or in graph processing codes when mutual exclusion is needed when accessing neighbors (such as SPECFEM3D).
The parallelization of this pattern consists traditionally of loop parallelism using OpenMP, for example using the taskloop construct, as depicted in the next code snippet:
#pragma omp parallel masked
#pragma omp taskloop
for (i = 0; i<SIZE; i++) {
NN = n_neighs[i];
for (int n=0; n<NN; n++) {
type_t contribution = f(current_state, i, n);
#pragma omp atomic //or critical
next_state[get_index(i,n)] += contribution;
}
}
The array next_state
is accessed (and updated) concurrently by multiple threads via an array of indexes: different iterations will be updating different positions of the array and there is the potential for races, so the code must include the appropriate synchronization mechanisms such as the atomic construct depicted in the code.
So this pattern consists of enclosing the update operation in an atomic or critical OpenMP construct so we can ensure that races on the updates are avoided.
This approach has two potential issues:
In real practice, if the array is sufficiently large, the first issue may be more important in terms of performance degradation, causing a low IPC value that prevents a good Computation Scalability. An example of these symptoms can be seen in the following images. These results where obtained in Nord III using the alya-asm program:
Other strategies have been proposed in the literature to implement this loop parallelism, like coloring or substructuring techniques to circumvent the race condition. The main drawback of the first technique is the IPC decrease due to bad spatial locality. The second technique avoids this issue but requires extensive changes in the implementation. An alternative, based on task parallelism, is to use multidependencies, an extension to the OpenMP programming model, that solves both aforementioned problems.
Recommended best-practice(s):