CalculiX solver (OpenMP)

Version's name: CalculiX solver (OpenMP) ; a version of the CalculiX solver program.
Repository: [home] and version downloads: [.zip] [.tar.gz] [.tar.bz2] [.tar]
Patterns and behaviours: Implemented best practices: Conditional nested tasks within an unbalanced phase ·

This program represents the behavior of the GMRES solver found in the CalculiX application. However, in contrast to the original program this version now uses the OpenMP programming model instead of the pthread one. It solves the first non-symmetric linear system occuring in the very first timestep of simulation of airflow through a bend pipe.

The structure of the program is given by the following (pseudo) code:

gmres_serial(int i)
    // get data chunk based on i
    // perform GMRES solver on data chunk in serial
#pragma omp parallel num_threads(NUM_CPUS)
    int thread_num = omp_get_thread_num();

In order to solve a large sparse linear system of size 1,536,000 x 1,536,000 with more than 10 million non zero elements #Tsmaller subsystems are created, where #Tdenotes the number of worker threads used. Each thread solves one of these small subsystems on its own using a serial GMRES implementation. So basically, each thread computes a subset of rows of the final solution vector of the whole system.

By having a close look to the original pthreads version of the program one may recognize that the structure of the thread creation is very similar to an OpenMP parallel region. Hence, in this version of the program the threads are created using the OpenMP programming model. The program creates as many threads as given by the value of NUM_CPUS which can be set using the OMP_NUM_THREADS environment variable.

Additionally, this version of the program also implements a best-practice in order to improve the related pattern Load imbalance due to computational complexity (unknown a prior). The best practice conditionally uses nested tasks within an unbalanced phase. For the nested approach consider the structure of one GMRES iteration. The most important subroutines are the following:

  1. matvec - performs a matrix vector multiplication to compute a new Krylov subspace vector
  2. dorth - orthogonalization of a set of Krylov subspace vectors
  3. daxpy - scaled vector dot product used inside dorth
  4. msolve - applies preconditioning matrix to a vector
  5. drlcal - calculates scaled residual vector

All of these subroutines perform some kind of vector operation. These operations are performed by looping over the individual vector elements and applying some kind of scalar operation to them. Thus, they can quite easily be parallelized by using nested tasks that perform a subset of these scalar operations depending on some grainsize as described by the best-practices “Conditional nested tasks within an unbalanced phase”.

Instead of using nested tasks one could also implement nested parallel regions in a similar manner. For demonstration purposes only a similar approach called “Conditional nested parallel region within an unbalanced phase” is implemented as well.

Building the program

Along with the source code of the program comes a Makefile. This Makefile offers three different targets

  • pattern-omp
  • solution-nested
  • solution-taskloop

The pattern-omp target will build the program representing the original solver behavior, however, using OpenMP instead of pthreads. Executing this program will show a behavior related to “Load imbalance due to computational complexity (unknown a priori)”.

The target solution-taskloop will create a binary with an implementation of the best-practice “Conditional nested tasks within an unbalanced phase”.

Similarly, the solution-nested target will create a binary of the program implementing “Conditional nested parallel region within an unbalanced phase” for comparison with the above best-practice.

Moreover, this Makefile offers the possibility to instrument the code using scorep. Therefor put a comment to line 2 and line 3 and uncomment lines 6,7 and 8 such that the beginning of the Makefile looks like this:

# run without scorep instrumentation

# run with scorep instrumentation
SCOREP = scorep --user --nocompiler
SCOREP_NOOMP = $(SCOREP) --noopenmp
The following experiments have been registered: