SPMxV on SPR and Grace

SPR and Grace

The SPMXV kernel was mainly tested on two systems:

A dual-socket Intel Xeon Platinum 8468 (Sapphire Rapids) system (2x48 cores, 2.1 GHz)
A single socket Nvidia Grace CPU (1x72 cores, 3.0 GHz)

With approximately 5 percent, both systems reached a similar fraction of their theoretical peak performance. The theoretical peak performance of the systems was calcu- lated using their available FMA streams: Ppeak= Cores × 64bit FMA instructions per cycle × frequency. The achieved maximal performance does not depend upon the compiler except for the Intel CPU, where the GCC compiler lags behind the Intel compiler (explanations are given in the detailed analysis).


GFlops performance of SPMXV kernel


Fraction of peak performance for SPMXV kernel


Impact of Datasize on SPMXV performance

Further, the code was evaluated against BLAS like libraries that implement SPMXV on the Grace CPU. The tested libraries, ARM PL and Nvidia PL, reached around 3\% of the peak performance on the Grace CPU with the given matrices. The developed SPMXV kernel outperforming the architecture-specific libraries indicates the lack of highly optimized SPMXV libraries on the ARM Neoverse V2 architecture so far. However, other libraries that implement SPMXV have not yet been evaluated.

Finally, the behaviour of the kernel was tested as the size of the problem increased. The similar cache sizes on both machines lead to a similar scaling behaviour, with a performance drop at the L3 cache size line. However, with very large matrix sizes where the matrix does not fit into the L3 cache, the Grace CPU provides better performance compared to the Sapphire Rapids.