These recommendations will address the holistic space form application refactoring to using or proposing new features in the system software or hardware architecture. Our holistic vision of co-design aims at addressing the issues at the level (or split between levels) that maximizes the gain at the minimal cost.
The following entries will suggest basic directions and provide links to code or raw data that can be used to further explore the options and quantify the benefits.
A number of methods of automatic generation of vector instructions have been developed for optimizing compilers. They allow vectorizing code without control branches or with some template branches. Consider the Automatic vectorization of a vectoring allowing to vectorize cycles without management branches. The basic idea of the Automatic vectorization consists of Loop Unroll to create copies of the original scalar expression and further replacement by vector instructions of groups of scalar instructions. Automatic vectorization is a special case of automatic paralleling where the program code is converted from a scalar implementation that processes one operation on several operand pairs at once.Programs: For loops full auto-vectorization ·
Modern HPC file systems are designed to handle writing a large amount of data to a file. However, if the application performs a lot of write operations that write data in very small chunks this leads to an inefficient use of the file system’s capabilities. This becomes even more apparent if the file system is connected to the HPC system via a network. In this case each write operation initiates a separate data transfer over the network. So every time this happens one also pays the latency to establish the connection to the file system. This effect can easily sum up if a large number of small write operations happens.Programs: CalculiX I/O (buffered) ·
A trade-off between parallelism and overhead in terms of the POP metrics corresponds to balancing the Load Balance and Communication efficiency. In general, we aim to create enough chunks (parallelism) in order to utilize all available computational resources (threads). At the same time, we need to keep the related overhead low. Less overhead is reflected by higher Communication efficiency. On the other hand, scheduling of threads at the runtime is often needed in order to balance the workload effectively. This leads to better Load Balance at the cost of the lower Communication Efficiency.Programs: BEM4I miniApp (Chunksize 500) · BEM4I miniApp (One chunk per thread) ·
This best-practice involves the creation of coarse grain tasks (including communication and computation), using task dependencies to drive work-flow execution. Parallelism is then exposed at the coarser grain level, whenever it is possible. Using as the starting point the following pattern:Programs: FFTXlib (ompss) ·
In other cases N may be small (close or equal to the number of cores) and the
do_work may be large. This happens for example in the Calculix
application from which the kernel in the figure below shows an
excerpt. In that case, the individual
do_work computations correspond to the
solution of independent systems of equations with iterative methods that may
have different convergence properties for each of the systems.
This best practice recommends removing the critical statement from the parallel region. This can be achieved by moving the critical region outside of the loop, often by caching per-thread results in some way, and finalising the results in a single or master only region. This trades the serialisation and lock-management overheads for some additional serial execution but will often lead to overall performance improvement due to performance gains in the parallel region.Programs: OpenMP Critical (critical section replaced by master-only execution) ·
Changing the domain decomposition may improve the communication imbalance of the application. Traditionally, domain decomposition algorithms only take into consideration the number of elements to be computed within the rank. A potentially interesting approach would be to modify the domain decomposition algorithm such that the cost function accounts for the number of elements within a domain, as well as its number of neighbours (appropriately weighted) and the communications the resulting domain must establish with them.Programs: Communication Imbalance (rebalance) ·
When parallelizing a loop which independent iterations may have different execution time, the number of iterations is large (compared to the number of cores), and the granularity of each iteration may be small. A parallelization like:
When the individual iterations of an for loop have significant differences in
runtime, a manual distribution or the naive OpenMP for loop might lead to
sub-optimal results. However, OpenMP provides several advanced worksharing
constructs that help the developer to mitigate this situation. OpenMP provides
several scheduling options to distribute the loop iterations among the threads
and, since OpenMP 4.5, the
taskloop which creates tasks for the loop
iterations that are scheduled by the OpenMP runtime.
When Python applications have heavy computing functions using generic data types it is possible to drastically increase sequential performance by using Numba or Numpy. Both packages can be used at the same time or separately depending on code’s needs.Programs: Python loops (numba+numpy) · Python loops (numba) · Python loops (numpy) ·
In some codes, the problem space is divided into a multi-dimensional grid on which the computations are performed. This pattern typically results in codes with multiple nested loops iterating over the grid, as shown in the following code snippet.
Parallel applications often exhibit a pattern where computation and communication is serialized. For example, the following pseudo code performs computation on some
data_A and after this computation is finished, some
data_B are communicated between processes:
We consider it is a good practice to taskify these operations allowing them to
execute concurrently and far before the actual MPI call in the case of packs or
far after in the case of unpacks. These operations are typically not very large
and very memory bandwidth bound. For that reason we think it is a good practice
not to parallelize each of them with fork join parallel do for each of them as
granularity will be very fine and the overhead will have a significant impact.
The size typically varies a lot between different pack/unpacks to/from
different processes. Using a sufficiently large grain size may allow for some
of these operations to be parallelized if the message is large while executing
as a single task for small buffers (see dependence flow between
MPI_Isend() operations in the following code).
This best practice uses a parallel library for file I/O to write to a single binary file. A parallel library can make optimal use of any underlying parallel file system, and will give better performance than serial file I/O. Additionally, reading and writing binary is more efficiency than writing the equivalent ASCII data.Programs: Parallel File I/O (parallel-library-mpiio-collective-access) · Parallel File I/O (parallel-library-mpiio-independent-access) · Parallel File I/O (parallel-library-netcdf-collective-access) · Parallel File I/O (parallel-library-netcdf-independent-access) · Parallel File I/O (serial-ascii) ·
This best practice uses multiple files for reading and writing, e.g. one file per process. This approach may be appropriate when a single file isn’t required, e.g. when writing checkpoint data for restarting on the same number of processes, or where it is optimal to aggregate multiple files at the end.Programs: Parallel File I/O (parallel-ascii-multifile) · Parallel File I/O (parallel-binary-multifile) · Parallel File I/O (serial-ascii) ·
The recommended best-practice in this case will consist on postponing the execution of wait on send buffer as much as possible, in order to potentially overlap the send operation with the Computational phase. There are several ways we can delay this operation.Programs: False communication-computation overlap (postpone-wait) ·
A simple way to address the issue would be to sort the list in ways that avoid such endpoint contention. Optimal communication schedules can be computed, but in practice, just starting each list by the first neighbor with rang higher that the sender and proceeding circularly to the lower ranked neighbor when the number of processes in the communicator is reached will probably reduce the endpoint contention effect.
This best practice recommends that if the critical block is performing a reduction operation, this be replaced by the OpenMP reduction clause which has a much lower overhead than a critical section.Programs: OpenMP Critical (master) · OpenMP Critical (critical section replaced by a reduction) ·
In order to dynamically balance the computational work among MPI ranks, a small set of routines must be implemented that perform the following tasks:
This is one of the several alternatives to parallelize the packing and unpacking operations when using a Message Passing Interface. The main idea consist on encapsulating send and receives calls within an unstructured task, giving the opportunity to overlap those communication with computation or with other communication.