BEM4I miniApp - One chunk per thread

The last loop in the presented compound matrix-vector multiplication in the BEM4I kernel has a very large number of iterations that contain quite a small amount of computations. The original code divided this loop into chunks containing only one iteration, these chunks were dynamically distributed among threads. Such a setting leads to an immense overhead. In the following images, we show the application states measured for implemented best-practice with chunksize computed so that each thread has only one chunk. The blue color stands for computation state, yellow is openMP scheduling and fork-join, red stands for synchronization state, and orange is MPI communication.

In the first image, we present a timeline of one matrix-vector multiplication, where the region of interest is the last small non-blue part just before the MPI communication.

OneIteration

The second image zooms this region of interest and we can see a detailed view of the overhead when using one chunk per thread. There is only one yellow region at the beginning and one at the end of the loop. The execution within the loop is well balanced due the equall distribution of the iterations, and their constant and small size.

One Iteration - detail to the final loop

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 676553 (POP1) and 824080 (POP2).

Currently, the project receives funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101143931 (POP3). The JU receives support from the European Union's Horizon Europe research and innovation programme and Spain, Germany, France, Portugal and the Czech Republic.