BEM4I miniApp

Version's name: BEM4I miniApp ; a version of the BEM4I miniApp program.
Repository: [home] and version downloads: [.zip] [.tar.gz] [.tar.bz2] [.tar]
Patterns and behaviours:

Very fine-grained tasks/chunks

Recommended best-practices:

Chunk/task grain-size trade-off (parallelism/overhead)

- Available version(s):

BEM4I miniApp (Chunksize 500)

BEM4I miniApp (One chunk per thread)

Overlapping computation and communication

This program represents the original behaviour of the matrix-vector multiplication found within the GMRES solver in the BEM4I application. The Helmholtz equation on a given domain is transformed into a system of linear equations. This global system represented by a matrix is compound from four different matrices, namely $K$, $K'$, $V$, and $D$. A parallel approach implemented in the library constructs these matrices on different processes related to different parts of the original domain, and proceed the GMRES solver with distributed global matrix and vectors.

The global system matrix is block-diagonal, compound from $N$ blocks (number of domains) of size 2x2, $A = \begin{bmatrix} A_1 & & \\ & \ddots & \\ & & A_N \end{bmatrix}, \; A_i = \begin{bmatrix} -K_i & V_i \\ D_i &K_i' \end{bmatrix}.$

Each individual matrix is decomposed and distributed among $P$ MPI processes in such a way, that each process contains some rows and columns of the original matrix. The individual matrices are further approximated using hierarchical clustering where pairs of well-seprated clusters are approximated by a product of two low-rank matrices and non-admissible clusters are assembled as dense matrices.

Matrix-vector multiplication on a global system matrix is implemented by local multiplications of four individual block-matrices ($K$, $K'$, $V$, and $D$) followed by one vector multiplication over the all global indices. The computed results are subsequently incorporated into the global system using MPI collective reduction operation MPI_Allreduce.

...
#pragma omp parallel
{
   // K
#pragma omp for schedule(dynamic, 1)
   for(int i = 0; i < k_blocks; i++)
   { ... } // apply block K_i

   // K'
#pragma omp for schedule(dynamic, 1)
   for(int i = 0; i < kt_blocks; i++)
   { ... } // apply block K'_i

   // V
#pragma omp for schedule(dynamic, 1)
   for(int i = 0; i < v_blocks; i++)
   { ... } //apply block V_i

   // D
#pragma omp for schedule(dynamic, 1)
   for(int i = 0; i < d_blocks; i++)
   { ... } // apply block D_i

   // loop over all degrees of freedom
#pragma omp for schedule(dynamic, 1)
   for(int j = 0; j < nDOFs; j++)
   { ... } // apply vector-vector

} // end of parallel region

MPI_Allreduce(...);

In this code above, the first four parallel loops are related to individual sub-matrices consisting of some number of blocks based on the problem size (in our test case it is several thousand blocks). Each iteration inside these loops performs a lot of computation. While the last parallel loop has the same number of iterations as there are degrees of freedom for the overall compound linear system (nDOFs is around 10^7). This iteration contains much less computation.

The following experiments have been registered: