Effect of dynamic load balancing

In this experiment, the performance of the various implementations is evaluated. The cpu variants are executed on a dual socket node with Intel Xeon Gold 6248 CPUs. This memory in this CPU is organized in a single NUMA node. The GPU variant was executed on a node with NVIDIA V100 GPUs.

The graph below shows the number of processed elements per second and MPI rank. The performance of the CPU implementations decreased with increasing size of the data set due to the limited cache size. On the other hand, the performance of the GPU implementation (OpenCL_V100) increases with increasing data size as the GPU needs a certain amount of work in order to hide the memory latency efficiently. For more than 120 elements in each direction, the performance decreases due to the limited cache size of the GPU. At 200 elements in each direction, the performance of the GPU implementation is about 2.5 higher than the performance of the OpenMP version.

GitHub Logo

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 676553 (POP1) and 824080 (POP2).

Currently, the project receives funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101143931 (POP3). The JU receives support from the European Union's Horizon Europe research and innovation programme and Spain, Germany, France, Portugal and the Czech Republic.