SPMxV on Rhea

The matrix (resp. vector) size used in our experimentation campaign was around 58 MB (resp. 4 MB): with such sizes, neither matrix nor vector could be entirely stored in the L1 or L2 cache, however, the vector will fit into the L3 (allowing temporal locality) while the matrix will exceed the L3 cache sizes for most of our systems under test except SPR and Grace, which had large enough L3 caches to hold both the whole matrix and vector. With respect to cache behaviour, every matrix element is accessed with stride 1 (perfect spatial locality) but only used once (no temporal locality), cache size impact will be limited.

Timing Tests: Single Core/Multicore

The figure below presents the performance numbers obtained on a single core for our various systems and compilers under test.


Performance in GLOPS unicore configuration. Higher is better.

These results should be considered carefully because first they correspond to unrealistic use of the system and second they introduce a strong bias in the performance analysis: a single core is using the full system memory hierarchy. However, such tests allow us to detect differences in compiler behaviour and also in the quality of the generated code. All in all, both Neoverse V2 (Grace and G4) achieve similar performance for both GCC and ACFL (Arm Compiler For LINUX), with GCC benefitting from the -OFast option. Neoverse V1 is slightly behind but faster than Sapphire Rapid. The older systems (Skylake and Neoverse N1) are lagging behind. On the compiler front, GCC with the -Ofast option provides systematically (except on SPR) a performance gain over GCC with -O3. On all systems, GCC with -Ofast performance matches the native compilers: ICX on Intel and ACFL on Ampere.

In multicore measurements, Grace and Sapphire Rapid systems have two clear advantages: first, the larger number of cores and second, their scalable memory system, which has good scalability properties. Sapphire Rapid reveals some interesting multicore characteristics. First, hyperthreading provides a performance boost, hinting that memory latency is one of the major performance bottlenecks of SMPXV. Second, GCC achieves significantly lower performance than ICPX while on unicore both compilers achieved similar performance: more detailed performance analysis indicated that the OpenMP GCC Library (GOMP) was adding a large overhead. The three older systems (SKL, Ampere and G3) are clearly lagging behind with respect to performance, mainly due to their lower core counts. Interestingly, there is a major difference in compiler behavior between GRACE and G4: on GRACE, GCC performance (both -O3 and -Ofast) is lower than ACFL while on G4 all compilers achieve similar performance (similar situation as the unicore case).


GFLOPS (histogram) and efficiency (red curve) for SPMXV Multicore runs. Higher is better.

The efficiency curve shown in the figure above confirms the very good SPR scalability : efficiency is above 90 percent with a remarkable peak at 107 percent with hyperthreading enabled. G4 has better efficiency (60 percent) than Grace (40 percent) essentially due to a smaller number of cores. Skylake exhibited good scalability with efficiencies of around 60 percent while G3 and Ampere exhibited poor efficiency (around 30 percent).

Refining the analysis by normalizing by the number of cores and frequency (see figure below) sets back GRACE at a lower level than G4, which ends up leading the pack. Finally, the most interesting part is again Sapphire Rapids, which turns out to be the fastest when using normalization. Furthermore, parallelism used at the outermost loop level did not introduce major overhead except for GCC on Sapphire Rapid.


Multicore run GFLOPs normalized (divided) by number of cores and frequency.

Compiler Analysis

Although the SPMXV loop seems to be extremely simple, it offers a unique combination of challenges for the compilers. The outermost loop (on \(i\)) is fully parallel and has a very large iteration count (number of elements in array \(x\)). The innermost loop on \(nz\) is much more complex to optimize: first, the loop iteration count is variable and, in general, small (less than 15 in our test case), second the loop corresponds to reduction (scalar dot product), and finally the access to array \(x\) is indirect. The only simple characteristic is access to the matrix values which has perfect spatial locality (stride 1 access), but all data is exactly used once.

Compilers have a large panel of code generation possibilities which in fact, have been used by various compilers throughout our systematic testing. First, compilers could stay with a scalar loop which could be further optimized by unrolling two or four times. However, Unrolling use will require then a final reduction step. Second, compilers could vectorize the loop, requiring the use of scatter/gather instructions (which are costly) then dealing with peel/tail and finally performing the final reduction stage. Third, compilers could perform partial vectorization, only vectorizing loads and/or the FP multiplies. Finally, the peel/tail loop could be suppressed by using masked instructions available in SVE and AVX 512. For that simple loop nest, compilers have a large amount of freedom degrees which should be carefully evaluated and selected by appropriate cost models.

All in all, it is interesting to see that the use of shorter vectors (NEON) and partial vectorization finally gave the best performance on both GRACE and G4.


Analysis of code generated by different compilers for SPMXV

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 676553 (POP1) and 824080 (POP2).

Currently, the project receives funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101143931 (POP3). The JU receives support from the European Union's Horizon Europe research and innovation programme and Spain, Germany, France, Portugal and the Czech Republic.