Parallelization of indirect reductions on large data structures

Sub-pattern(s):
  • Indirect reductions with atomic or critical constructs on large data structures: A pattern that appears on many codes computing on a grid of topologically connected nodes is the reduction of values on the nodes with possible multiple contributions from neighboring nodes. This pattern typically results in codes with reduction operations through indirection on a large array that use atomic or critical constructs to avoid the potential data race when updating it. (more...)

A pattern that appears on many codes computing on a grid of topologically connected nodes is the reduction of values on the nodes with possible multiple contributions from neighboring nodes.

This pattern typically results in codes with reduction operations through indirection on a large array as depicted in the next code skeleton:

for (i = 0; i<SIZE; i++) {
   NN = n_neighs[i];
   for (int n=0; n<NN; n++) {
      type_t contribution =  f(current_state, i, n);
      next_state[get_index(i,n)] += contribution;
   }
}

This pattern occurs in finite element codes (such as Alya), n-body codes, or in graph processing codes when mutual exclusion is needed when accessing neighbors (such as SPECFEM3D).

The parallelization of this pattern consists traditionally of loop parallelism using OpenMP. However, by applying such strategies the array next_state is usually accessed (and updated) concurrently by multiple threads via an array of indexes. Thus, different iterations will be updating different positions of the array and there is the potential for races, so the code must include the appropriate synchronization mechanisms.

Several strategies have been proposed in the literature to implement this loop parallelism, like coloring or substructuring techniques to circumvent this race condition. The main drawback of the first technique is the IPC decrease due to bad spatial locality. The second technique avoids this issue but requires extensive changes in the implementation. An alternative, based on task parallelism, is to use multidependencies, an extension to the OpenMP programming model, that solves both aforementioned problems.

In the next lines, a more in-depth discussion of the different approaches to solve this issue is presented.

The first approach would be enclosing the update operation in an atomic or critical OpenMP construct so we can ensure that races on the updates are avoided.

The second option would be allocating a private instance of the array for each thread/task and performing a final reduction at the end of the computation. This has the overhead of allocating the replicated arrays, the memory footprint this may require, the need to initialize all the replicated arrays, and the need to perform at the end of the computation a reduction of those arrays values to the original one. This is an additional overhead in terms of computations. Even if it can be parallelized by making each thread responsible for a part of the final array, there is the issue of being forced to wait for all threads to compute their assigned part of the loop iteration space before starting the final reduction. This approach is very frequent and can be found for example in LAMMPS.

Another option would be to analyze the topology of the problem and to split the set of nodes into disjoint subsets (colors) that guarantee that two nodes in the same color do not share a neighbor. In this case, each of the colors can be parallelized independently without needing the atomic constructs. An issue of this approach is the need to fork-join parallelism for each color in sequence. If any of the colors have few elements, the fork-join synchronization and overhead may reduce the performance. The second issue is that all nodes within a color will be necessarily topologically separated, resulting in memory access patterns being very scattered and thus poor locality.

The fourth approach consists of a task-based serialization of the reduction. The computation of the contributions can be taskified at the appropriate granularity and its result stored in a temporal data structure passed as input to a reduction task that is serialized (with _inout _ or _mutexinoutset _ dependence on the final array). In a sense, this would be an intermediate approach between the first two options explained above (full privatization and atomic approaches). It needs to allocate and manage the storage for the temporal results, although the refactoring is potentially not huge. An example of this approach can be found in DMRG.

Last, but not least, one may consider the use of multidependencies: The idea is to split the iteration space into tasks, precompute which of those tasks have “incompatibility” (modify at least one shared node), and leverage the multidependencies feature in OpenMP to achieve at runtime a scheduling effect comparable to coloring but at coarse granularity (tasks) and dynamic. Precomputing the incompatibilities is an additional code to be written and may represent an overhead. In any case, if the topology is relatively invariant this can be amortized over many iterations. We show an example of how it can be done using OmpSs in alya-asm program.