With the given input file, the application can be split into two phases. The initialisation phase starts MPI, reads the input data from disk, and builds the mesh. The next and final phase is the main computational phase. No intermittent or final output is done with this configuration.
The initialisation phase is short (roughly one minute with 192 ranks) compared to the duration of the main computational phase (roughly 6-8 hours for production runs). We have therefore excluded the initialisation phase from further detailed analysis.
In the given configuration, the main computational phase consists of 1024 iterations. In production runs, this number can be arbitrarily large. These iterations are further structured into blocks of 8 iterations (as seen in Fig. 1). Within the 8-iteration block, only non-blocking point-to-point MPI communication is done. The groups are separated by a blocking collective MPI_Allreduce which synchronises MPI ranks globally. For simplicity, and in order to capture multiple collective communication events between blocks, we extend our analysis to the whole main computation phase, i.e. all 1024 iterations, and refer to this region as focus of analysis or simply FoA.