List of Programs

Access Pattern Bench

Access Pattern Bench is a set of programs to simulate various memory access patterns that can arise in applications and have an impact on performance and efficiency. This kernel does not use any parallel programming model like MPI or OpenMP but is mainly focusing on serial access patterns investigating cache and vectorization behavior. Nevertheless, these access patterns can also appear for single processes/threads in parallel workloads.

Alya assembly

Alya is a simulation code for high performance computational mechanics. Alya solves coupled multiphysics problems using high performance computing techniques for distributed and shared memory supercomputers, together with vectorization and optimization at the node level.

BEM4I miniApp

BEM4I is a library of parallel boundary element based solvers developed at IT4Innovations National Supercomputing Center. It supports solutions of the Laplace, Helmholtz, Lame, and wave equations. The library implements OpenMP and hybrid OpenMP/MPI parallelization. The development is focused on an efficient implementation utilizing multi- and many-core architecture. System matrices assembled within the BEM are generally dense and the library uses Adaptive Cross Approximation technique to approximate them. The resulted linear system is solved by the appropriate iterative solver based on the quality of the system matrix. For Helmholtz and wave equations, the solver is the GMRES method, for Laplace and Lame it can be the CG method.

BLAS Tuning

A code for timing parallel matrix multiplication, where all processes have access to the data in A and B, and the parallelisation is achieved by splitting B over the columns, i.e.:

CalculiX IO

CalculiX is a free three dimensional structural finite element analysis program. It supports linear and non-linear calculations of static, dynamic and thermal problems. The code is written in C and Fortran. Parallelization is achieved using the pthread programming model.

CalculiX solver

CalculiX is a free three dimensional structural finite element analysis program. It supports linear and non-linear calculations of static, dynamic and thermal problems. The code is written in C and Fortran. Parallelization is achieved using the pthread programming model.

Communication computation trade-off

This is a simple molecular dynamics code, with a relatively large amount of communication relative to the computation. The code solves Newton’s second law of motion for every atom: \({\bf F}=m{\bf a}\).

Communication Imbalance

The Communication Imbalance kernel is a synthetic program which reproduces a communication pattern in between several MPI processes. Initially it computes a connectivity matrix which represents from/to which ranks will comunicate to one each other, and it also preassigns a given number of elements to each rank.


This kernel code implements the solution of the 3D diffusion equation. There are currently three different implementations: cpu_diffusion which uses a single CPU core, cpu_openmp_diffusion which useses multiple CPU cores via OpenMP and opencl_diffusion where the iterations are computed on the GPU while the CPU launches kernels and manages the date transfer between MPI ranks.

DuMuX DUNE kernel

DuMuX DUNE is a free and open-source simulator for flow and transport processes in porous media written in C++. This is the DuMuX DUNE kernel, which implement one of the communication and computation patterns found in DuMuX DUNE. The kernel implements a sparse alltoallv communication pattern where computation is performed on the individual communicated buffers.

False communication-computation overlap

The False communication-computation overlap kernel is a synthetic program which reproduces a communication/computation pattern between several MPI processes.


FFTXlib is the stand-alone kernel that represents the Fast Fourier Transformation (FFT) algorithm used in the Quantum ESPRESSO application, one of the most used plane-wave Density Functional Theory (DFT) codes in the community of material science. The FFT kernel implements a layered MPI communication with FFT task groups to split the cost of collective communication operations to balance the impact on the performance.

For loops auto-vectorization

For loops auto-vectorization covers the essentials of optimizing the utilization of vector instructions to compute a given data-parallel workload. In this context, the compiler can provide valuable information about the limitations of the program and also hints on how to modify the code to fully optimize it.


The GPU kernel implemts a basic matrix multiplication.

juKKR kloop

The Juelich KKR code family (JuKKR) is a collection of codes developed at Forschungszentrum Juelich implementing the Korringa-Kohn-Rostoker (KKR) Green’s function method to perform density functional theory calculations. Since the KKR method is based on a multiple scattering formalism it allows for highly accurate all-electron calculations.


JuPedSim is an open source framework for simulating, analyzing and visualizing pedestrian dynamics in complex geometries, with the possibility for several exits and obstacles.

OMP Collapse

In many parallel codes, the problem space is divided in a multi-dimensional grid, where the computations are performed. When the computations on the grid are independent one from another, it is possible to divide the work among multiple resources (e.g. CPUs). This pattern typically results in codes with multi-nested loops, iterating over the grid space.

OpenMP Critical

An oil & gas code had the openmp-critical-section pattern and the computational aspects of the original code are recreated here. This application solves the 3D wave equation: \(\frac{\partial^{2}u}{\partial t^{2}} = c^{2}\nabla^{2}u\) using the pseudospectral method. However, for the WP7 kernel the finite difference method was selected for the purpose of simplicity, which re-creates the computational profile of the original code. The code contains a critical section within a parallel OpenMP loop, which greatly slows its performance.

Parallel File I/O

A naive approach to file I/O in parallel software is for one process to sequentially read/write ASCII data to/from a single file (e.g. using the C fscanf and fprintf commands) with point to point communications to share the data with all other processes.

Python loops

The kernel Python loops is a synthetic program based on a real world HPC Python script that reproduces an inefficient way to write loop-compute algorithms in Python.


RankDLB demonstrates performance issues arising in programs where the computational load per MPI rank evolves over time and therefore creates a load imbalance among MPI ranks. The computational problem must contain a coupling between MPI ranks where data is exchanged between ranks after the computation of a single iteration has completed.


Sam(oa)² stands for Space-Filling Curves and Adaptive Meshes for Oceanic And Other Applications and has been developed at TU Munich. Sam(oa)² is a PDE framework that can, for example, simulate wave propagations during tsunamis using adaptive meshes and space filling curves. It can either simulate built-in sample scenarios like radial_dam_break (perfectly balanced) and oscillating lake (imbalanced), or be used with datasets from real tsunamis by providing corresponding bathymetric and displacement data files.

SIFEL kernel

SIFEL (SImple Finite ELements) is an open source computer finite element (FE) code that has been developing since 2001 at the Department of Mechanics of the Faculty of Civil Engineering of the Czech Technical University in Prague. It is C/C++ code parallelized with MPI. This is the SIFEL kernel, which implement the LDL matrix factorization from a sparse matrix in symmetric skyline format and the computation of the Schur complement matrix.