A Centre of Excellence in HPC

Repository: [home] and version downloads: [.zip] [.tar.gz] [.tar.bz2] [.tar]

Patterns and behaviours: Load imbalance due to computational complexity (unknown a priori) ·

Implemented best practices: Conditional nested tasks within an unbalanced phase ·

This program represents the behavior of the *GMRES* solver found in the CalculiX application. However, in contrast to the original program this version now uses the *OpenMP* programming model instead of the pthread one.
It solves the first non-symmetric linear system occuring in the very first timestep of simulation of airflow through a bend pipe.

The structure of the program is given by the following (pseudo) code:

```
gmres_serial(int i)
{
// get data chunk based on i
// perform GMRES solver on data chunk in serial
}
```

```
#pragma omp parallel num_threads(NUM_CPUS)
{
int thread_num = omp_get_thread_num();
gmres_serial(thread_num);
}
```

In order to solve a large sparse linear system of size `1,536,000 x 1,536,000`

with more than 10 million non zero elements `#T`

smaller subsystems are created, where `#T`

denotes the number of worker threads used.
Each thread solves one of these small subsystems on its own using a serial GMRES implementation.
So basically, each thread computes a subset of rows of the final solution vector of the whole system.

By having a close look to the original pthreads version of the program one may recognize that the structure of the thread creation is very similar to an OpenMP parallel region.
Hence, in this version of the program the threads are created using the *OpenMP* programming model.
The program creates as many threads as given by the value of `NUM_CPUS`

which can be set using the `OMP_NUM_THREADS`

environment variable.

Additionally, this version of the program also implements a best-practice in order to improve the related pattern Load imbalance due to computational complexity (unknown a prior).
The best practice conditionally uses *nested tasks* within an unbalanced phase.
For the nested approach consider the structure of one GMRES iteration. The most important subroutines are the following:

`matvec`

- performs a matrix vector multiplication to compute a new Krylov subspace vector`dorth`

- orthogonalization of a set of Krylov subspace vectors`daxpy`

- scaled vector dot product used inside`dorth`

`msolve`

- applies preconditioning matrix to a vector`drlcal`

- calculates scaled residual vector

All of these subroutines perform some kind of vector operation. These operations are performed by looping over the individual vector elements and applying some kind of scalar operation to them. Thus, they can quite easily be parallelized by using nested tasks that perform a subset of these scalar operations depending on some grainsize as described by the best-practices “Conditional nested tasks within an unbalanced phase”.

Instead of using nested tasks one could also implement *nested parallel regions* in a similar manner.
For demonstration purposes only a similar approach called “Conditional nested parallel region within an unbalanced phase” is implemented as well.

Along with the source code of the program comes a *Makefile*.
This *Makefile* offers three different targets

`pattern-omp`

`solution-nested`

`solution-taskloop`

The `pattern-omp`

target will build the program representing the original solver behavior, however, using *OpenMP* instead of *pthreads*.
Executing this program will show a behavior related to “Load imbalance due to computational complexity (unknown a priori)”.

The target `solution-taskloop`

will create a binary with an implementation of the best-practice “Conditional nested tasks within an unbalanced phase”.

Similarly, the `solution-nested`

target will create a binary of the program implementing “Conditional nested parallel region within an unbalanced phase” for comparison with the above best-practice.

Moreover, this *Makefile* offers the possibility to instrument the code using *scorep*.
Therefor put a comment to line 2 and line 3 and uncomment lines 6,7 and 8 such that the beginning of the *Makefile* looks like this:

```
# run without scorep instrumentation
#SCOREP =
#SCOREP_FLAG =
# run with scorep instrumentation
SCOREP = scorep --user --nocompiler
SCOREP_NOOMP = $(SCOREP) --noopenmp
SCOREP_FLAG = -DSCOREP
###############################################
```