List of Patterns and Behaviors

In this page we describe typical behavioural patterns that we have identified in the analysis of applications in different domains. By behavioural pattern we understand typical sequences of operations, memory accesses, communications and or synchronizations that perform general algorithmic steps appearing in many different programs. These patterns may result in potential performance degradations.

The objective is to identify such patterns in generic terms, provide links to applications that expose them and links to best practices how we consider they should be addressed. Although we tried to group the different patterns by relationship between the issues, the list of patterns is somewhat unstructured. We suggest looking at the global list and its introductory description to identify the topics that may be relevant for your co-design target.

Communication imbalance in MPI

By communication imbalance we refer to the situation where the amount of MPI calls or the message sizes change between processes. Of course the time taken by these communications depends on many factors, of which the number of calls, message sizes are important but also the type of MPI call (blocking or not) or the location of the processes involved in the communication (local within the node or remote).

Best-Practice(s): Re-consider domain decomposition ·

Dynamic load imbalance in MPI

This pattern focuses on the temporal evolution of load imbalances (LI). One possible scenario is the growth of LI over time, i.e. a gradual reduction of the computational performance. This reduction might happen at a slow rate so that it does not immediately catch the attention of the programmer.

Best-Practice(s): Solving dynamic load balance problems · Task migration among processes ·

Sequence of fine grain parallel loops

This pattern applies to parallel programming models based on a shared memory environment and the fork-join execution model (e.g., OpenMP). The execution of this kind of applications is initialy sequential (i.e., only one thread starts the execution of the whole program), and just when arriving at the region of the code containing potential parallelism, a new parallel region is created and multiple threads will start the execution of the associated code. Parallel execution usually will distribute the code among participating threads by means of work-sharing directives (e.g., a loop directive will distribute the loop iteration space among all threads participating in that region).

Best-Practice(s): Collapse consecutive parallel regions · Upper level parallelisation ·

GPU branch divergence

This pattern applies to GPU computing, where the execution of a thread block is divided into warps with a constant number of threads per warp. When threads in the same warp follow different paths of control flow, these threads diverge in their execution, which serializes the execution.

Best-Practice(s): Align branch granularity to warps ·

GPU uncoalesced memory transfer

For CPU-based applications, stride-1 access to memory by each thread is very efficient. However, for effective utilization of memory bandwidth on GPUs, adjacent threads must access adjacent data elements in global memory.

Best-Practice(s): GPU align memory accesses ·

High weighted communication in between ranks

This pattern can be observed in parallel algorithms where a large fraction of the runtime is spent calculating data on individual processes and then communicating that data between processes. An example of such a parallel algorithm is molecular dynamics, where N particles interact with each other. The equation of motion to be solved is

Best-Practice(s): Replicating computation to avoid communication (gemm) ·

Load imbalance due to computational complexity (unknown a priori)

In some algorithms it is possible to divide the whole computation into smaller, independent subparts. These subparts may then be executed in parallel by different workers. Even though the data, which is worked on in each subpart, might be well balanced in terms of memory requirements there may be a load imbalance of the workloads. This imbalance may occur if the computational complexity of each subpart depends on the actual data and cannot be estimated prior to execution.

Best-Practice(s): Conditional nested tasks within an unbalanced phase · Dynamic loop scheduling · Over-decomposition using tasking · Task migration among processes ·

Sequences of independent computations with expensive global communication

It is frequent to find codes where multiple independent operations are executed in sequence. If each operation has communication phases constituting a high percentage of its total execution time, the overall performance will be low compare with the real potential of the application.

Best-Practice(s): Coarse grain (comp + comm) taskyfication with dependencies ·

Parallelization of indirect reductions on large data structures

A pattern that appears on many codes computing on a grid of topologically connected nodes is the reduction of values on the nodes with possible multiple contributions from neighboring nodes.

Sub-pattern(s): Indirect reductions with atomic or critical constructs on large data structures ·

Inefficient user implementation of well-known math problem

In the early stages of scientific code development, developers (possibly scientists) tend to create a naive and easy-to-read implementation of a required algorithm. Such an implementation have a great chance to be inefficient in some performance aspect.

Best-Practice(s): Leverage optimized math libraries ·

Load imbalance in hybrid programming (MPI+X)

Load imbalance is a longstanding issue in parallel programming and one of the primary contributors to efficiency loss in high-performance computing (HPC) systems. It occurs when the execution time required to complete the workload assigned to different processes varies, meaning some processes finish faster than others. While the most heavily-loaded processes continue to compute, the less-loaded ones are idle, waiting for the next synchronization point, thus wasting computational resources and reducing overall efficiency.

Best-Practice(s): Compensate load imbalance with DLB ·

Loop iterations manually distributed

Some OpenMP application developers manually distribute the iterations of loops in parallel regions instead of using OpenMP’s worksharing constructs. This can prevent the runtime from using optimized schedules.

Best-Practice(s): Leveraging OpenMP worksharing constructs ·

Low computational performance calling BLAS routines (gemm)

In many scientific applications, parallel matrix-matrix multiplications represent the computational bottleneck. Consider the following matrix-matrix multiplication:

Best-Practice(s): Tuning BLAS routines invocation (gemm) ·

Low transfer efficiency following large computation

Many parallel algorithms on distributed memory systems contain a certain pattern, where a computational phase is serially followed by collective communication to share computed results. Moreover, this is often present in an iterative process, thus this pattern repeats in the algorithm many times.

Best-Practice(s): Overlapping computation and communication ·

Inefficient file I/O due to many unbuffered write operations

In a naive implementation, I/O operations are most likely implemented in serial. Data is read from and written to disk on demand whenever it is required to do so. However, this might lead to a significant performance decrease if the amount of data transferred to or from file is very small in a single operation and many of these operations happen.

Best-Practice(s): Using buffered write operations ·

MPI endpoint contention

MPI processes often have to communicate with a list of neighbours. Depending on the order of send and receive calls it may happen that many processes get “synchronized” in that all of them try to send at the same time to the same given destination, resulting in the limited incoming bandwidth at the destination becoming a limiter for the overall communication performance.

Best-Practice(s): Re-schedule communications ·

Identifying suitable MPI programs for execution on GPU

This text outlines criteria that can be used to identify MPI programs that can be implemented on GPUs. The first criterion considered is the structure of the code where an iterative structure is favorable. The second criterion is the algorithm itself, that should have a high level of inherent parallelism. The third criterion is the size of the data set that should not exceed the memory available on the GPUs.

Best-Practice(s): Porting code to GPU (iterative kernel execution) ·

Wait for non-blocking send operations preventing computational progress

MPI programmers often use non-blocking calls to overlap communication and computation. In such codes, the MPI process communicates with its neighbors through a sequence of stages: 1) an optional pack of data (if needed); 2) a set of send/receive non-blocking operations, which potentially could overlap one to each other; 3) wait for communications (potentially splitting for send and receive requests; and 4) the computational phase.

Best-Practice(s): Postpone the execution of non-blocking send waits operations ·

Lack of iterations on an OpenMP parallel loop

In some codes, the problem space is divided into a multi-dimensional grid on which the computations are performed. This pattern typically results in codes such as jCFD_Genesis, with multiple nested loops iterating over the grid, as shown in the following code snippet.

Best-Practice(s): Collapsing OpenMP parallel loops ·

OpenMP critical section

The OpenMP standard provides a critical section construct, which only allows one thread to execute the block of code within the construct. This feature allows blocks of code to be protected from race conditions, for example with write accesses into a shared array or incrementing a shared counter. However, usage of this construct, especially within parallel loops, can severely reduce performance. This is due to serialisation of the execution causing threads to “queue” to enter the critical region, as well as introducing large lock-management overheads required to manage the critical region.

Best-Practice(s): Replace OpenMP critical section with master-only execution · Replacing critical section with reduction ·

Poor GPU data transfer rate due to affinity

Most CPUs are organized in multiple NUMA domains. Accessing different parts of the memory from a certain core of such a CPU therefore has different performance. We call this affinity. Since each available GPU on a system is connected to a NUMA domain this effect also is observable in data transfers from and to the GPU. Depending on the location of the data on the host side and the core that is handling the data transfer, both bandwidth and latency can vary significantly.

Best-Practice(s): Appropriate process/thread mapping to GPUs ·

Sequential communications in hybrid programming

A frequent practice in hybrid programming is to only parallelize with OpenMP the main computational regions. The communication phases are left as in the original MPI program and thus execute in order in the main thread while other threads are idling. This may limit the scalability of hybrid programs and often results in the hybrid code being slower than an equivalent pure MPI code using the same total number of cores.

Best-Practice(s): Parallelize packing and unpacking regions ·

Sequential loops

Most of the program execution time is spent on cycles. One complication with parallelizing the sequential loop is that the body of the loop may also depend on the previous invocations of its self.

Best-Practice(s): Effective auto vectorization ·

Inefficient Python loops

With Python it is very easy to unconsciously produce extremely inefficient code as it is an interpreted language. One need to put special attention on the data types and sentences used in order to mitigate interpreter’s overhead since generic Python objects are several orders of magnitude slower than other alternatives. Therefore, after the prototyping phases when developing Python software, users need to identify the heaviest compute functions and apply to them the most suitable optimization.

Best-Practice(s): Usage of Numba and Numpy to improve Python's serial efficiency ·

Sequential ASCII file I/O

In this pattern data held on all processes is read or written to an ASCII file by a single process. This is inefficient for several reasons:

Best-Practice(s): Parallel library file I/O · Parallel multi-file I/O ·

Spatial locality poor performance

To achieve good performance in scientific and industrial software it is essential to review how data structures are designed inside the software and stored/loaded by the underlying programming language or machine. Although such design decisions might be beneficial in terms of readability and maintenance, it could significantly degrade performance if the way data is accessed and processed in the application does not match the data layout. There are multiple examples for a mismatch between data layout and access that might harm spatial locality of the access pattern including but not limited to the following: Arrays of Structure (AoS) or Memory layout of multi-dimensional arrays depending on the programming language.

Best-Practice(s): Spatial locality performance improvement ·

Very fine-grained tasks/chunks

The essential purpose of parallel programming is to reduce the runtime by dividing time-demanding computations among available resources. The granularity is an important aspect that is often overlooked. The amount of work assigned to a computational core/thread has to have a reasonable size so that the overhead caused by thread/core control does not degrade the performance. When the tasks or chunks are too small, this overhead can cost the same or even more time than the computation itself. In such a case, running a program in parallel may not bring any benefit or even can consume more time and resources than running as a single thread application.

Best-Practice(s): Chunk/task grain-size trade-off (parallelism/overhead) ·

Writing unstructured data to linear file formats

Many applications cope with unstructured data, which result in unbalanced data distribution over processes. A simple example for this are particle simulations where particles are moved around between processes over time. So, when the simulation shall write the global state at the end - or in-between, e.g., for a checkpoint - each process will have a different number of particles. This makes it hard to write data efficiently to a file in a contiguous way. The same pattern can also be found, e.g., in applications using unstructured meshes with different number of elements per process.

Best-Practice(s): MPI-I/O with pre-computed offsets for each processes ·

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 676553 (POP1) and 824080 (POP2).

Currently, the project receives funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101143931 (POP3). The JU receives support from the European Union's Horizon Europe research and innovation programme and Spain, Germany, France, Portugal and the Czech Republic.