The following experiment presents an execution with the Sam(oa)² work-sharing version. Folder scripts/calix/
contains scripts to automatically run the application by specifying most of the configuration. Additionally to execution time, it has flags for creating traces using Extrae or Intel Trace Analyzer. Detailed information on how to run, reproduce and evaluate this experiment can be found in the README file located in the root folder.
This experiment has been conducted on one compute node of the CLAIX-2018 cluster partition at RWTH Aachen University. Each node is a two-socket system featuring Intel Skylake Platinum 8160 CPUs with 24 cores running at a base frequency of 2.1 GHz. Hyper-Threading is disabled and Sub-NUMA-Clustering (SNC) is enabled. Process pinning and thread binding have been applied to ensure that each thread of the application runs on a separate physical core.
Although the application can scale to a much higher number of compute nodes, processes and threads, we chose this setup to better visualize the effects of the different versions.
For these results we simulated 60 seconds of the Tohoku tsunami in 2011, using 2 processes per node, each with 11 OpenMP threads to limit the size of the trace and to be able to show the desired effects. The domain decomposition used 16 sections per thread to yield a sufficient degree of over-decomposition.
The following trace shows 3 time steps of the main time stepping loop for the work-sharing version, that does fixed assignment of sections to threads, similar to an OpenMP static schedule.
Trace (work-sharing) |
As shown, there are load imbalances observable between threads in each process as well as between processes, as the time to traverse a section might vary and the assignment of sections to processes and threads is static in this version.