Another issue in the BEM4I kernel might be seen in a very long computation part which is followed by a quite long collective communication (MPI_Allreduce function). During this MPI communication, all threads are doing nothing. Since this presented code appears multiple-times within each iteration of the GMRES solver, we may expect that it will be repeated a thousand times.
Here we present useful computation and MPI communication during one matrix-vector multiplication.