This is a version of the original BEM4I kernel code, where the work-sharing loop over all degrees of freedom in the global system is processed with chunk size set to 500
.
...
#pragma omp parallel
{
// apply K, K', V and D
{ ... }
// loop over all degrees of freedom
#pragma omp for schedule(dynamic, 500)
for(int j = 0; j < nDOFs; j++)
{ ... }
} // end of parallel region
MPI_Allreduce(...);