A Centre of Excellence in HPC

Access Pattern Bench is a set of programs to simulate various memory access patterns that can arise in applications and have an impact on performance and efficiency. This kernel does not use any parallel programming model like MPI or OpenMP but is mainly focusing on serial access patterns investigating cache and vectorization behavior. Nevertheless, these access patterns can also appear for single processes/threads in parallel workloads.

Alya is a simulation code for high performance computational mechanics. Alya solves coupled multiphysics problems using high performance computing techniques for distributed and shared memory supercomputers, together with vectorization and optimization at the node level.

BEM4I is a library of parallel boundary element based solvers developed at IT4Innovations National Supercomputing Center. It supports solutions of the Laplace, Helmholtz, Lame, and wave equations. The library implements OpenMP and hybrid OpenMP/MPI parallelization. The development is focused on an efficient implementation utilizing multi- and many-core architecture. System matrices assembled within the BEM are generally dense and the library uses Adaptive Cross Approximation technique to approximate them. The resulted linear system is solved by the appropriate iterative solver based on the quality of the system matrix. For Helmholtz and wave equations, the solver is the GMRES method, for Laplace and Lame it can be the CG method.

A code for timing parallel matrix multiplication, where all processes have access to the data in A and B, and the parallelisation is achieved by splitting B over the columns, i.e.:

CalculiX is a free three dimensional structural finite element analysis program. It supports linear and non-linear calculations of static, dynamic and thermal problems. The code is written in C and Fortran. Parallelization is achieved using the pthread programming model.

CalculiX is a free three dimensional structural finite element analysis program. It supports linear and non-linear calculations of static, dynamic and thermal problems. The code is written in C and Fortran. Parallelization is achieved using the pthread programming model.

This is a simple molecular dynamics code, with a relatively large amount of communication relative to the computation. The code solves Newton’s second law of motion for every atom: \({\bf F}=m{\bf a}\).

The *Communication Imbalance* kernel is a synthetic program which reproduces a
communication pattern in between several MPI processes. Initially it computes a
connectivity matrix which represents from/to which ranks will comunicate to one
each other, and it also preassigns a given number of elements to each rank.

This kernel code implements the solution of the 3D diffusion equation. There are currently three different implementations: *cpu_diffusion* which uses a single CPU core, *cpu_openmp_diffusion* which useses multiple CPU cores via OpenMP and *opencl_diffusion* where the iterations are computed on the GPU while the CPU launches kernels and manages the date transfer between MPI ranks.

DuMuX DUNE is a free and open-source simulator for flow and transport processes in porous media written in C++. This is the DuMuX DUNE kernel, which implement one of the communication and computation patterns found in DuMuX DUNE. The kernel implements a sparse alltoallv communication pattern where computation is performed on the individual communicated buffers.

The *False communication-computation overlap* kernel is a synthetic program which reproduces a communication/computation
pattern between several MPI processes.

FFTXlib is the stand-alone kernel that represents the *Fast Fourier
Transformation* (FFT) algorithm used in the *Quantum ESPRESSO* application, one
of the most used plane-wave *Density Functional Theory* (DFT) codes in the
community of material science. The FFT kernel implements a layered MPI
communication with FFT task groups to split the cost of collective
communication operations to balance the impact on the performance.

*For loops auto-vectorization* covers the essentials of optimizing the utilization of vector instructions to compute a given data-parallel workload. In this context, the compiler can provide valuable information about the limitations of the program and also hints on how to modify the code to fully optimize it.

The GPU kernel implemts a basic matrix multiplication.

The Juelich KKR code family (JuKKR) is a collection of codes developed at Forschungszentrum Juelich implementing the Korringa-Kohn-Rostoker (KKR) Green’s function method to perform density functional theory calculations. Since the KKR method is based on a multiple scattering formalism it allows for highly accurate all-electron calculations.

JuPedSim is an open source framework for simulating, analyzing and visualizing pedestrian dynamics in complex geometries, with the possibility for several exits and obstacles.

In many parallel codes, the problem space is divided in a multi-dimensional grid, where the computations are performed. When the computations on the grid are independent one from another, it is possible to divide the work among multiple resources (e.g. CPUs). This pattern typically results in codes with multi-nested loops, iterating over the grid space.

An oil & gas code had the openmp-critical-section pattern and the computational aspects of the original code are recreated here. This application solves the 3D wave equation: \(\frac{\partial^{2}u}{\partial t^{2}} = c^{2}\nabla^{2}u\) using the pseudospectral method. However, for the WP7 kernel the finite difference method was selected for the purpose of simplicity, which re-creates the computational profile of the original code. The code contains a critical section within a parallel OpenMP loop, which greatly slows its performance.

A naive approach to file I/O in parallel software is for one process to sequentially read/write ASCII data to/from a single file (e.g. using the C fscanf and fprintf commands) with point to point communications to share the data with all other processes.

The kernel *Python loops* is a synthetic program based on a real world HPC Python script that reproduces an inefficient way
to write loop-compute algorithms in Python.

RankDLB demonstrates performance issues arising in programs where the computational load per MPI rank evolves over time and therefore creates a load imbalance among MPI ranks. The computational problem must contain a coupling between MPI ranks where data is exchanged between ranks after the computation of a single iteration has completed.

Sam(oa)² stands for *Space-Filling Curves and Adaptive Meshes for Oceanic And Other Applications* and has been developed at TU Munich. Sam(oa)² is a PDE framework that can, for example, simulate wave propagations during tsunamis using adaptive meshes and space filling curves. It can either simulate built-in sample scenarios like *radial_dam_break* (perfectly balanced) and *oscillating lake* (imbalanced), or be used with datasets from real tsunamis by providing corresponding bathymetric and displacement data files.

SIFEL (SImple Finite ELements) is an open source computer finite element (FE) code that has been developing since 2001 at the Department of Mechanics of the Faculty of Civil Engineering of the Czech Technical University in Prague. It is C/C++ code parallelized with MPI. This is the SIFEL kernel, which implement the LDL matrix factorization from a sparse matrix in symmetric skyline format and the computation of the Schur complement matrix.