These recommendations will address the holistic space form application refactoring to using or proposing new features in the system software or hardware architecture. Our holistic vision of co-design aims at addressing the issues at the level (or split between levels) that maximizes the gain at the minimal cost.
The following entries will suggest basic directions and provide links to code or raw data that can be used to further explore the options and quantify the benefits.Using buffered write operations
Modern HPC file systems are designed to handle writing a large amount of data to a file. However, if the application performs a lot of write operations that write data in very small chunks this leads to an inefficient use of the file system’s capabilities. This becomes even more apparent if the file system is connected to the HPC system via a network. In this case each write operation initiates a separate data transfer over the network. So every time this happens one also pays the latency to establish the connection to the file system. This effect can easily sum up if a large number of small write operations happens.List of programs: Coarse grain (comp + comm) taskyfication with dependencies
This best-practice involves the creation of coarse grain tasks (including communication and computation), using task dependencies to drive work-flow execution. Parallelism is then exposed at the coarser grain level, whenever it is possible. Using as the starting point the following pattern:List of programs: Conditional nested tasks within an unbalanced phase
In other cases N may be small (close or equal to the number of cores) and the
do_work may be large. This happens for example in the Calculix
application from which the kernel in the figure below shows an
excerpt. In that case, the individual
do_work computations correspond to the
solution of independent systems of equations with iterative methods that may
have different convergence properties for each of the systems.
This best practice recommends removing the critical statement from the parallel region. This can be achieved by moving the critical region outside of the loop, often by caching per-thread results in some way, and finalising the results in a single or master only region. This trades the serialisation and lock-management overheads for some additional serial execution but will often lead to overall performance improvement due to performance gains in the parallel region.List of programs: Re-consider domain decomposition
Changing the domain decomposition may improve the communication imbalance of the application. Traditionally, domain decomposition algorithms only take into consideration the number of elements to be computed within the rank. A potentially interesting approach would be to modify the domain decomposition algorithm such that the cost function accounts for the number of elements within a domain, as well as its number of neighbours (appropriately weighted) and the communications the resulting domain must establish with them.List of programs: Dynamic loop scheduling
When parallelizing a loop which independent iterations may have different execution time, the number of iterations is large (compared to the number of cores), and the granularity of each iteration may be small. A parallelization like:List of programs: Parallelize packing and unpacking regions
We consider it is a good practice to taskify these operations allowing them to
execute concurrently and far before the actual MPI call in the case of packs or
far after in the case of unpacks. These operations are typically not very large
and very memory bandwidth bound. For that reason we think it is a good practice
not to parallelize each of them with fork join parallel do for each of them as
granularity will be very fine and the overhead will have a significant impact.
The size typically varies a lot between different pack/unpacks to/from
different processes. Using a sufficiently large grain size may allow for some
of these operations to be parallelized if the message is large while executing
as a single task for small buffers (see dependence flow between
MPI_Isend() operations in the following code).
This best practice uses a parallel library for file I/O to write to a single binary file. A parallel library can make optimal use of any underlying parallel file system, and will give better performance than serial file I/O. Additionally, reading and writing binary is more efficiency than writing the equivalent ASCII data.List of programs: Parallel multi-file I/O
This best practice uses multiple files for reading and writing, e.g. one file per process. This approach may be appropriate when a single file isn’t required, e.g. when writing checkpoint data for restarting on the same number of processes, or where it is optimal to aggregate multiple files at the end.List of programs: Postpone the execution of non-blocking send waits operations
The recommended best-practice in this case will consist on postponing the execution of wait on send buffer as much as possible, in order to potentially overlap the send operation with the Computational phase. There are several ways we can delay this operation.List of programs: Re-schedule communications
A simple way to address the issue would be to sort the list in ways that avoid such endpoint contention. Optimal communication schedules can be computed, but in practice, just starting each list by the first neighbor with rang higher that the sender and proceeding circularly to the lower ranked neighbor when the number of processes in the communicator is reached will probably reduce the endpoint contention effect.List of programs: Replacing critical section with reduction
This best practice recommends that if the critical block is performing a reduction operation, this be replaced by the OpenMP reduction clause which has a much lower overhead than a critical section.List of programs: Taskifying communications
This is one of the several alternatives to parallelize the packing and unpacking operations when using a Message Passing Interface. The main idea consist on encapsulating send and receives calls within an unstructured task, giving the opportunity to overlap those communication with computation or with other communication.List of programs: