Resources for Co-design at POP CoE

Patterns and behaviors

In this page we describe typical behavioural patterns that we have identified in the analysis of applications in different domains. By behavioural pattern we understand typical sequences of operations, memory accesses, communications and or synchronizations that perform general algorithmic steps appearing in many different programs. These patterns may result in potential performance degradations.

The objective is to identify such patterns in generic terms, provide links to applications that expose them and links to best practices how we consider they should be addressed. Although we tried to group the different patterns by relationship between the issues, the list of patterns is somewhat unstructured. We suggest looking at the global list and its introductory description to identify the topics that may be relevant for your co-design target.

Communication imbalance in MPI

By communication imbalance we refer to the situation where the amount of MPI calls or the message sizes change between processes. Of course the time taken by these communications depends on many factors, of which the number of calls, message sizes are important but also the type of MPI call (blocking or not) or the location of the processes involved in the communication (local within the node or remote).

Best-Practices: Re-consider domain decomposition ·
Load imbalance due to computational complexity (unknown a priori)

In some algorithms it is possible to divide the whole computation into smaller, independent subparts. These subparts may then be executed in parallel by different workers. Even though the data, which is worked on in each subpart, might be well balanced in terms of memory requirements there may be a load imbalance of the workloads. This imbalance may occur if the computational complexity of each subpart depends on the actual data and cannot be estimated prior to execution.

Best-Practices: Conditional nested tasks within an unbalanced phase · Dynamic loop scheduling ·
Sequences of independent computations with expensive global communication

It is frequent to find codes where multiple independent operations are executed in sequence. If each operation has communication phases constituting a high percentage of its total execution time, the overall performance will be low compare with the real potential of the application.

Best-Practices: Coarse grain (comp + comm) taskyfication with dependencies ·
Low transfer efficiency following large computation

Many parallel algorithms on distributed memory systems contain a certain pattern, where a computational phase is serially followed by collective communication to share computed results. Moreover, this is often present in an iterative process, thus this pattern repeats in the algorithm many times.

Best-Practices: Overlapping computation and communication ·
Inefficient file I/O due to many unbuffered write operations

In a naive implementation, I/O operations are most likely implemented in serial. Data is read from and written to disk on demand whenever it is required to do so. However, this might lead to a significant performance decrease if the amount of data transferred to or from file is very small in a single operation and many of these operations happen.

Best-Practices: Using buffered write operations ·
MPI endpoint contention

MPI processes often have to communicate with a list of neighbours. Depending on the order of send and receive calls it may happen that many processes get “synchronized” in that all of them try to send at the same time to the same given destination, resulting in the limited incoming bandwidth at the destination becoming a limiter for the overall communication performance.

Best-Practices: Re-schedule communications ·
Wait for non-blocking send operations preventing computational progress

MPI programmers often use non-blocking calls to overlap communication and computation. In such codes, the MPI process communicates with its neighbors through a sequence of stages: 1) an optional pack of data (if needed); 2) a set of send/receive non-blocking operations, which potentially could overlap one to each other; 3) wait for communications (potentially splitting for send and receive requests; and 4) the computational phase.

Best-Practices: Postpone the execution of non-blocking send waits operations ·
OpenMP critical section

The OpenMP standard provides a critical section construct, which only allows one thread to execute the block of code within the construct. This feature allows blocks of code to be protected from race conditions, for example with write accesses into a shared array or incrementing a shared counter. However, usage of this construct, especially within parallel loops, can severely reduce performance. This is due to serialisation of the execution causing threads to “queue” to enter the critical region, as well as introducing large lock-management overheads required to manage the critical region.

Best-Practices: Replace OpenMP critical section with master-only execution · Replacing critical section with reduction ·
Sequential communications in hybrid programming

A frequent practice in hybrid programming is to only parallelize with OpenMP the main computational regions. The communication phases are left as in the original MPI program and thus execute in order in the main thread while other threads are idling. This may limit the scalability of hybrid programs and often results in the hybrid code being slower than an equivalent pure MPI code using the same total number of cores.

Best-Practices: Parallelize packing and unpacking regions · Taskifying communications ·
Sequential ASCII file I/O

In this pattern data held on all processes is read or written to an ASCII file by a single process. This is inefficient for several reasons:

Best-Practices: Parallel library file I/O · Parallel multi-file I/O ·