The following experiment presents an execution with the Sam(oa)² OpenMP tasking version, as well as a comparison with the work-sharing version. Folder
scripts/calix/ contains scripts to automatically run the application by specifying most of the configuration. Additionally to execution time, it has flags for creating traces using Extrae or Intel Trace Analyzer. Detailed information on how to run, reproduce and evaluate this experiment can be found in the README file located in the root folder.
This experiment has been conducted on one compute node of the CLAIX-2018 cluster partition at RWTH Aachen University. Each node is a two-socket system featuring Intel Skylake Platinum 8160 CPUs with 24 cores running at a base frequency of 2.1 GHz. Hyper-Threading is disabled and Sub-NUMA-Clustering (SNC) is enabled. Process pinning and thread binding have been applied to ensure that each thread of the application runs on a separate physical core.
Although the application can scale to a much higher number of compute nodes, processes and threads, we chose this setup to better visualize the effects of the different versions.
For these results we simulated 60 seconds of the Tohoku tsunami in 2011, using 2 processes per node, each with 11 OpenMP threads to limit the size of the trace and to be able to show the desired effects. The domain decomposition used 16 sections per thread to yield a sufficient degree of over-decomposition.
The following trace shows 3 time steps of the main time stepping loop for the work-sharing version that does fixed assignment of sections to threads similar to an OpenMP static schedule.
As shown, there are load imbalances observable between threads in each process as well as between processes, as the time to traverse a section might vary and the assignment of sections to processes and threads is static in this version.
The following trace shows the same time steps and the same time scale but here OpenMP tasks have been used to implement section traversals, to allow for load balancing between threads in a process.
|Trace (OpenMP tasking)|
As illustrated, the load of each process is much more balanced when using OpenMP tasks and the execution time has also been reduced. However, the OpenMP tasking is not able to tackle the imbalance between processes that is still present.
To compare the execution time and to measure the speedup, the same experiment has been conducted simulating 10 minutes of the tsunami scenario and using 4 processes per node with 11 OpenMP threads each (one process per NUMA domain to avoid runtime variation due to remote memory accesses).
The following table shows the execution times, speedups and the hybrid load balance efficiency (HLBE) for the tested versions.
|Sam(oa)² Version||Execution Time||Speedup (compared to baseline)||Hybrid Load Balance Efficiency (HLBE)|
|Work-sharing (baseline)||160.23 sec||1.0||78.95 %|
|OpenMP tasking||150.10 sec||1.0674||81.93 %|
This shows that tasking in this case was able to achieve a speedup of 1.067x compared to the baseline. Additionally, there is a slight increase in the hybrid load balance efficiency. As the work imbalance between sections might increase with longer simulation times, it is also expected to gain even higher speedups with this approach.