The last loop in the presented compound matrix-vector multiplication in the BEM4I kernel has a very large number of iterations that contain quite a small amount of computations. The original code divided this loop into chunks containing only one iteration, these chunks very dynamically distributed among threads. Such a setting leads to an immense overhead.
In the following images, we show the application states
measured for implemented best-practice with chunksize set to be 500
. The blue color stands for computation state, yellow is openMP scheduling and fork-join, red stands for synchronization state, and orange is MPI communication.
In the first image, we present a timeline of one matrix-vector multiplication, where the region of interest is the last small non-blue part just before the MPI communication.
The second image zooms this region of interest and we can see a detailed view of the overhead when the chunksize is set to 500. Almost the whole loop is takes the yellow colour representing the Scheduling and Fork/Join state. It means that the effectivity of this loop is poor and the chunksize should be further enlarged.