BEM4I miniApp - Chunk size trade-off (500)

The last loop in the presented compound matrix-vector multiplication in the BEM4I kernel has a very large number of iterations that contain quite a small amount of computations. The original code divided this loop into chunks containing only one iteration, these chunks very dynamically distributed among threads. Such a setting leads to an immense overhead. In the following images, we show the application states measured for implemented best-practice with chunksize set to be 500. The blue color stands for computation state, yellow is openMP scheduling and fork-join, red stands for synchronization state, and orange is MPI communication.

In the first image, we present a timeline of one matrix-vector multiplication, where the region of interest is the last small non-blue part just before the MPI communication.

OneIteration

The second image zooms this region of interest and we can see a detailed view of the overhead when the chunksize is set to 500. Almost the whole loop is takes the yellow colour representing the Scheduling and Fork/Join state. It means that the effectivity of this loop is poor and the chunksize should be further enlarged.

One Iteration - detail to the final loop