The omp-seq-reduction and omp-par-reduction codes were run on a single 48-core node of MareNostrum-4 at Barcelona Supercomputing Center for a range of thread counts from 1 to 48 and the results compared to those of the original omp-master code. The problem size in all cases was specified by Ni=500, Nj=500, Nk=500 and Nt = 20. The run times and the average IPC (Instructions Per Cycle) values for all three versions are shown here.
![]() |
![]() |
Clearly, the new versions of the code show much better performance and have much better avaerage IPC values, with those for the omp-seq-reduction version being marginally better. In order to compare the run times of the omp-req-reduction and omp-par-reduction kernels, these times are plotted below without the much slower master kernel. The omp-seq-reduction version can now be seen to be slightly faster at the lower thread counts. This may of course vary with details of the simulation.