In this experiment, the performance of the various implementations is evaluated. The cpu variants are executed on a dual socket node with Intel Xeon Gold 6248 CPUs. This memory in this CPU is organized in a single NUMA node. The GPU variant was executed on a node with NVIDIA V100 GPUs.
The graph below shows the number of processed elements per second and MPI rank. The performance of the CPU implementations decreased with increasing size of the data set due to the limited cache size. On the other hand, the performance of the GPU implementation (OpenCL_V100) increases with increasing data size as the GPU needs a certain amount of work in order to hide the memory latency efficiently. For more than 120 elements in each direction, the performance decreases due to the limited cache size of the GPU. At 200 elements in each direction, the performance of the GPU implementation is about 2.5 higher than the performance of the OpenMP version.