The last loop in the presented compound matrix-vector multiplication in the BEM4I kernel has a very large number of iterations that contain quite a small amount of computations. The original code divided this loop into chunks containing only one iteration, these chunks were dynamically distributed among threads. Such a setting leads to an immense overhead.
In the following images, we show the application states
measured for implemented best-practice with chunksize computed so that each thread has only one chunk. The blue color stands for computation state, yellow is openMP scheduling and fork-join, red stands for synchronization state, and orange is MPI communication.
In the first image, we present a timeline of one matrix-vector multiplication, where the region of interest is the last small non-blue part just before the MPI communication.
The second image zooms this region of interest and we can see a detailed view of the overhead when using one chunk per thread. There is only one yellow region at the beginning and one at the end of the loop. The execution within the loop is well balanced due the equall distribution of the iterations, and their constant and small size.