To achieve good performance in scientific and industrial software it is essential to review how data structures are designed inside the software and stored/loaded by the underlying programming language or machine. Although such design decisions might be beneficial in terms of readability and maintenance, it could significantly degrade performance if the way data is accessed and processed in the application does not match the data layout. There are multiple examples for a mismatch between data layout and access that might harm spatial locality of the access pattern including but not limited to the following: Arrays of Structure (AoS) or Memory layout of multi-dimensional arrays depending on the programming language.
Structures or classes are a suitable way for grouping a number of parameters that belong to an entity. Although it provides a good overview of how data is structured, performance problems can arise when processing a larger number of these entities. The following scenario defines a struct with multiple member variables. A processing function then iterates over a larger array of entities accessing and modifying just a single member variable.
struct point {
double a;
double b;
double c;
double d;
double e;
double f;
double g;
double h;
};
struct point array_points[N]; // N is large
void process_a(double val) {
for(int i=0; i<N; i++) {
array_points[i].a += val;
}
}
Data of member variables for the entities are placed in memory in a consecutive way like illustrated in the following figure.
As just a single member variable will be accessed and modified this generates a number of performance problems:
a
) could in theory be performed at the same time for multiple points. However, the strided access pattern can prevent compilers from properly vectorizing the code.The memory layout of multi-dimensional arrays differs depending on the programming language.
Some languages (e.g., C, C++, Python, Pascal) employ a row-major order whereas others (e.g., Fortran and Julia) use a column-major order that prescribes how data is stored in memory as illustrated in the following figure.
The underlying layout also affects how data should be accessed or processed. A typical flaw is that developers iterate multi-dimensional arrays in a non-optimal way like illustrated in the following Fortran code example. As previously already explained this results in a strided access pattern. Switching the loops can already help to improve the performance.
real :: numbers(5,10) ! 2D array with 5 rows and 10 columns
integer :: i , j ! iteration variables
! Not efficient: strided access pattern as access pattern not matching memory layout
do i=1,5
do j=1,10
numbers(i,j) = i * 2.0 + j
end do
end do
! Better: layout and access pattern aligned => leverage spatial locality
do j=1,10
do i=1,5
numbers(i,j) = i * 2.0 + j
end do
end do
Analysing an application code with tools like Score-P and Extrae can help to detect issues with spatial locality. Even better is investigating Hardware Performance Counters (HWC) using Likwid or PAPI (Extrae interally uses PAPI). Metrics derived from those counter values can act as indicators whether such an issue is present in an application. The following metrics are the most promising: