Spatial locality poor performance

Usual symptom(s):
  • IPC Scaling: The IPC Scaling (IPCS) compares IPC to the reference case. (more...)

To achieve good performance in scientific and industrial software it is essential to review how data structures are designed inside the software and stored/loaded by the underlying programming language or machine. Although such design decisions might be beneficial in terms of readability and maintenance, it could significantly degrade performance if the way data is accessed and processed in the application does not match the data layout. There are multiple examples for a mismatch between data layout and access that might harm spatial locality of the access pattern including but not limited to the following: Arrays of Structure (AoS) or Memory layout of multi-dimensional arrays depending on the programming language.

Arrays of Structure (AoS)

Structures or classes are a suitable way for grouping a number of parameters that belong to an entity. Although it provides a good overview of how data is structured, performance problems can arise when processing a larger number of these entities. The following scenario defines a struct with multiple member variables. A processing function then iterates over a larger array of entities accessing and modifying just a single member variable.

struct point {
    double a;
    double b;
    double c;
    double d;
    double e;
    double f;
    double g;
    double h;
};

struct point array_points[N]; // N is large

void process_a(double val) {
    for(int i=0; i<N; i++) {
        array_points[i].a += val;
    }
}

Data of member variables for the entities are placed in memory in a consecutive way like illustrated in the following figure. AoS_data_layout

As just a single member variable will be accessed and modified this generates a number of performance problems:

  • Typical hardware architectures nowadays use CPU caches to speed up access to data items that are frequently accessed (temporal locality) or to data items that are close to items that have been previously accessed (spatial locality). If an item is not yet available in the cache (cache miss) it has to be loaded from memory. This usually happens in form of cache lines consisting of 64 Bytes (equivalent to 8 doubles). Thus, in theory when loading a single double the next 7 doubles in memory could be accessed faster. However, due to the chosen data layout the complete struct is loaded to cache and just a single double is used. This strided access pattern avoids taking advantage of spatial locality.
  • Several architectures nowadays support vectorization (SIMD) that allows executing an operation on a number of data items simultaneously. In this scenario the same operation (adding a value to member variable a) could in theory be performed at the same time for multiple points. However, the strided access pattern can prevent compilers from properly vectorizing the code.

Memory layout of multi-dimensional arrays

The memory layout of multi-dimensional arrays differs depending on the programming language. Some languages (e.g., C, C++, Python, Pascal) employ a row-major order whereas others (e.g., Fortran and Julia) use a column-major order that prescribes how data is stored in memory as illustrated in the following figure. Row_vs_Column_Major

The underlying layout also affects how data should be accessed or processed. A typical flaw is that developers iterate multi-dimensional arrays in a non-optimal way like illustrated in the following Fortran code example. As previously already explained this results in a strided access pattern. Switching the loops can already help to improve the performance.

real :: numbers(5,10) ! 2D array with 5 rows and 10 columns
integer :: i , j      ! iteration variables

! Not efficient: strided access pattern as access pattern not matching memory layout
do i=1,5
    do j=1,10
        numbers(i,j) = i * 2.0 + j
    end do
end do

! Better: layout and access pattern aligned => leverage spatial locality
do j=1,10
    do i=1,5
        numbers(i,j) = i * 2.0 + j
    end do
end do

How to detect issues with spatial locality

Analysing an application code with tools like Score-P and Extrae can help to detect issues with spatial locality. Even better is investigating Hardware Performance Counters (HWC) using Likwid or PAPI (Extrae interally uses PAPI). Metrics derived from those counter values can act as indicators whether such an issue is present in an application. The following metrics are the most promising:

  • Instructions per Cycle (IPC): Low values for IPC is in general a good indicator that an application has potential for optimization. However, low IPC alone cannot act as a sole reference as reasons might come from very different directions.
  • Stalls caused by L1D misses [%] and L2 request rate: Execution stalls that are caused by misses in L1 data cache or the rate in which the L2 cache is requested can point to a problem with spatial locality. The rate is typically the number of occurrences devided by the number of retired instructions.
  • L2 miss rate and L2 miss ratio [%]: Similar, L2 cache miss rate and ratio provide a good overview about the cache usage and spatial locality access patterns.
  • Vectorization and FLOP/s: Another set of metrics that are worth comparing are general FLOP/s achieved and FLOP/s achieved using vector registers to determine whether computations have been properly vectorized. However, as this can also be the result of missing compile flags or other issues in the code it is recommended to use this comparison in combination with other metrics like L2 miss rate or L2 miss ratio.
Recommended best-practice(s): Related program(s):
  • Access Pattern Bench (inefficient)