BEM4I miniApp - Corrected chunksize with computation and communication overlap

This experiments is obtained with the branch chunksize_overlap that implements changes on the last and originaly badly scheduled loop from the chunksize branch, and also implements computation and communication on segments for the first four computationaly large loops with matrices K, K', V and D.

.. todo images .. Images of one iteration with 1, 4, 8 and 16 segments

Table with MPI data - timings for MPI_Waitall, MPI_Iallreduce, and Outside of MPI

Table with efficiencies, instructions and IPC for all different settings