These recommendations will address the holistic space form application refactoring to using or proposing new features in the system software or hardware architecture. Our holistic vision of co-design aims at addressing the issues at the level (or split between levels) that maximizes the gain at the minimal cost.
The following entries will suggest basic directions and provide links to code or raw data that can be used to further explore the options and quantify the benefits.
To mitigate the problem of GPU branch divergence it is recommended to align the branch granularity to the warp size.
A number of methods of automatic generation of vector instructions have been developed for optimizing compilers. They allow vectorizing code without control branches or with some template branches. Consider the Automatic vectorization of a vectoring allowing to vectorize cycles without management branches. The basic idea of the Automatic vectorization consists of Loop Unroll to create copies of the original scalar expression and further replacement by vector instructions of groups of scalar instructions. Automatic vectorization is a special case of automatic paralleling where the program code is converted from a scalar implementation that processes one operation on several operand pairs at once.
Programs: For loops full auto-vectorization ·Modern HPC file systems are designed to handle writing a large amount of data to a file. However, if the application performs a lot of write operations that write data in very small chunks this leads to an inefficient use of the file system’s capabilities. This becomes even more apparent if the file system is connected to the HPC system via a network. In this case each write operation initiates a separate data transfer over the network. So every time this happens one also pays the latency to establish the connection to the file system. This effect can easily sum up if a large number of small write operations happens.
Programs: CalculiX I/O (buffered) ·A trade-off between parallelism and overhead in terms of the POP metrics corresponds to balancing the Load Balance and Communication efficiency. In general, we aim to create enough chunks (parallelism) in order to utilize all available computational resources (threads). At the same time, we need to keep the related overhead low. Less overhead is reflected by higher Communication efficiency. On the other hand, scheduling of threads at the runtime is often needed in order to balance the workload effectively. This leads to better Load Balance at the cost of the lower Communication Efficiency.
Programs: BEM4I miniApp (Chunksize 500) · BEM4I miniApp (One chunk per thread) ·This best-practice involves the creation of coarse grain tasks (including communication and computation), using task dependencies to drive work-flow execution. Parallelism is then exposed at the coarser grain level, whenever it is possible. Using as the starting point the following pattern:
Programs: FFTXlib (ompss) ·The main idea behind collapsing parallel regions is to reduce the overhead of the fork-join phases. This technique consists on substituting a sequence of parallel work-sharing regions with a single parallel region and multiple inner work-sharing constructs.
In other cases N may be small (close or equal to the number of cores) and the
granularity of do_work
may be large. This happens for example in the Calculix
application from which the kernel in the figure below shows an
excerpt. In that case, the individual do_work
computations correspond to the
solution of independent systems of equations with iterative methods that may
have different convergence properties for each of the systems.
This best practice recommends removing the critical statement from the parallel region. This can be achieved by moving the critical region outside of the loop, often by caching per-thread results in some way, and finalising the results in a single or master only region. This trades the serialisation and lock-management overheads for some additional serial execution but will often lead to overall performance improvement due to performance gains in the parallel region.
Programs: OpenMP Critical (critical section replaced by master-only execution) ·Changing the domain decomposition may improve the communication imbalance of the application. Traditionally, domain decomposition algorithms only take into consideration the number of elements to be computed within the rank. A potentially interesting approach would be to modify the domain decomposition algorithm such that the cost function accounts for the number of elements within a domain, as well as its number of neighbours (appropriately weighted) and the communications the resulting domain must establish with them.
Programs: Communication Imbalance (rebalance) ·When parallelizing a loop which independent iterations may have different execution time, the number of iterations is large (compared to the number of cores), and the granularity of each iteration may be small. A parallelization like:
To achieve maximum memory bandwidth the developer needs to align memory accesses to 128 byte boundaries. The ideal situation is a sequential access by all the threads in a warp, as shown in the following figure where 32 threads in a warp access 32 consecutive words of memory. However, sequential access is not required (anymore, it was for early generations of GPGPUs), e.g a pattern like will still yield in fully coalesced memory accesses.
Programs: GPU Kernel Optimized ·When the individual iterations of an for loop have significant differences in
runtime, a manual distribution or the naive OpenMP for loop might lead to
sub-optimal results. However, OpenMP provides several advanced worksharing
constructs that help the developer to mitigate this situation. OpenMP provides
several scheduling options to distribute the loop iterations among the threads
and, since OpenMP 4.5, the taskloop
which creates tasks for the loop
iterations that are scheduled by the OpenMP runtime.
The fundamentals of this best-practice lie in the rule: Do not reinvent the wheel. If a developer recognizes a well-known math routine that possibly causes a bottleneck in the program, it is recommended to do a short research on available libraries that implement the recognized routine and suit the program.
Programs: SIFEL kernel (factorization by math library) ·Writing unstructured data with differing amounts of data per process to a contiguous file is challenging. If the ordering of the data in the file is not of importance, one approach using MPI I/O functionalities to achieve good performance is to pre-compute per process offsets into the file so that each process can then write his local data starting from this position without interfering with the other processes.
Programs: Writing unstructured data to a contiguous file using MPI I/O with offsets ·The idea behind the use of multidependencies is to split the iteration space into tasks, precompute which of those tasks have “incompatibility” (modify at least one shared node), and leverage the multidependencies feature in OpenMP to achieve at runtime a scheduling effect comparable to coloring but with a higher level of parallelism and dynamic.
Programs: Alya assembly (multidependencies) ·When Python applications have heavy computing functions using generic data types it is possible to drastically increase sequential performance by using Numba or Numpy. Both packages can be used at the same time or separately depending on code’s needs.
Programs: Python loops (numba+numpy) · Python loops (numba) · Python loops (numpy) ·In some codes, the problem space is divided into a multi-dimensional grid on which the computations are performed. This pattern typically results in codes with multiple nested loops iterating over the grid, as shown in the following code snippet.
Programs: OpenMP nested loop collapse ·There are multiple approaches to tackle load imbalances in an application. This best practice shows how over-decomposition with e.g., OpenMP tasking can help to reduce load imbalances as an alternative to classic loop-based solutions. In the initial high-level example that has been presented in the pattern description, the workload of each iteration might be different or depending on i but is not known beforehand. Distributing the iteration across threads can then lead to load imbalances. A conventional loop-based approach could use a dynamic schedule to mitigate the imbalance by dynamically scheduling chunks of iterations to idling threads.
Programs: Sam(oa)² (Chameleon) · Sam(oa)² (OpenMP tasking) ·An easy way to implement communication and computation overlap consists on leveraging the existing communication and computational tasks, relying on task dependencies to deal with their synchronizations and link the application with the TAMPI library.
Parallel applications often exhibit a pattern where computation and communication is serialized. For example, the following pseudo code performs computation on some data_A
and after this computation is finished, some data_B
are communicated between processes:
We consider it is a good practice to taskify these operations allowing them to
execute concurrently and far before the actual MPI call in the case of packs or
far after in the case of unpacks. These operations are typically not very large
and very memory bandwidth bound. For that reason we think it is a good practice
not to parallelize each of them with fork join parallel do for each of them as
granularity will be very fine and the overhead will have a significant impact.
The size typically varies a lot between different pack/unpacks to/from
different processes. Using a sufficiently large grain size may allow for some
of these operations to be parallelized if the message is large while executing
as a single task for small buffers (see dependence flow between pack
and
MPI_Isend()
operations in the following code).
This best practice uses a parallel library for file I/O to write to a single binary file. A parallel library can make optimal use of any underlying parallel file system, and will give better performance than serial file I/O. Additionally, reading and writing binary is more efficiency than writing the equivalent ASCII data.
Programs: Parallel File I/O (parallel-library-mpiio-collective-access) · Parallel File I/O (parallel-library-mpiio-independent-access) · Parallel File I/O (parallel-library-netcdf-collective-access) · Parallel File I/O (parallel-library-netcdf-independent-access) · Parallel File I/O (serial-ascii) ·This best practice uses multiple files for reading and writing, e.g. one file per process. This approach may be appropriate when a single file isn’t required, e.g. when writing checkpoint data for restarting on the same number of processes, or where it is optimal to aggregate multiple files at the end.
Programs: Parallel File I/O (parallel-ascii-multifile) · Parallel File I/O (parallel-binary-multifile) · Parallel File I/O (serial-ascii) ·This text describes a programming pattern for GPUs where the computation is performed iteratively. Between iterations, data exchange among MPI ranks takes place. The code implements classical domain decomposition where each patch is processed by a dedicated MPI rank using a single GPU. In an ideal case, the existing domain decomposition can be reused while the computation is performed by the GPU instead of the CPU.
Programs: CPU to GPU, OpenCL version ·The recommended best-practice in this case will consist on postponing the execution of wait on send buffer as much as possible, in order to potentially overlap the send operation with the Computational phase. There are several ways we can delay this operation.
Programs: False communication-computation overlap (postpone-wait) ·A simple way to address the issue would be to sort the list in ways that avoid such endpoint contention. Optimal communication schedules can be computed, but in practice, just starting each list by the first neighbor with rang higher that the sender and proceeding circularly to the lower ranked neighbor when the number of processes in the communicator is reached will probably reduce the endpoint contention effect.
This best practice recommends that if the critical block is performing a reduction operation, this be replaced by the OpenMP reduction clause which has a much lower overhead than a critical section.
Programs: OpenMP Critical (master) · OpenMP Critical (critical section replaced by a reduction) ·The best-practice presented here can be used in parallel algorithms where a large fraction of the runtime is spent calculating data on individual processes and then communicating that data between processes. For such applications, the performance can be improved by replicating computation across processes to avoid communication. A simple example can be found in molecular dynamics, where we have the computation of the interactions between N
particles. The equations of motion to be solved are
In order to dynamically balance the computational work among MPI ranks, a small set of routines must be implemented that perform the following tasks:
This best practice recommends to align data layout and data access pattern to efficiently use available resources and take advantage of e.g., the caching behavior and vectorization units of the underlying architecture.
Programs: Access Pattern Bench (optimized) ·There are multiple approaches to tackle load imbalances between processes in an application that might run on physically different compute nodes. Especially, if the workload is not known a-priori and might also dynamically change over time it is hard to apply a proper domain decomposition or rebalancing steps during the application run. Further, complete global data and workload redistribution after e.g, every iteration of a simulation might be a too expensive. This best practice presents an approach using over-decomposition with tasks and task migration to mitigate the load imbalances between processes.
Programs: Sam(oa)² (Chameleon) ·In many scientific applications, parallel matrix-matrix multiplications represent the computational bottleneck. Consider the following matrix-matrix multiplication:
Programs: Tuning BLAS routines invocation ·When thinking about parallelizing an application, one should always try to apply parallelization on an upper level of the call tree hierarchy. The higher the level of parallelization the higher the degree of parallelism that can be exploited by the application. This best-practice shows a scenario, where moving the parallelization to an upper level improves the performance significantly.
Programs: JuKKR kloop (openmp) ·