LLM-Attengion, EPI Co-design on RISC-V platforms

In this experiment, we summarize the conclusions derived from the study conducted on the target RISC-V platforms. For additional details, the reader is referred to the following document:

Leonel Sousa, et al. : “D4.4 - Second report on codesign”, Version 1.0, Deliverable of the Performance Optimization and Productivity 3 (POP3) Centre of Excellence. December, 2025.

A substantial portion of the algorithm consists of matrix multiplication, for which RAVE reports a vectorization intensity of approximately 20%. The results indicate that compiler options, particularly those affecting mathematical routines, exert a notable influence on throughput.

Furthermore, the choice of numerical precision has measurable performance implications: casting to double precision reduces throughput and increases execution time relative to a purely single-precision implementation.

Performance comparisons across platforms reveal additional architectural effects. For example, the Banana Pi platform sustains approximately half the throughput achieved on MareNostrum when operating at comparable vector lengths. Moreover, no performance advantage was observed when using AVX-512 on MareNostrum 5 compared to AVX2, which warrants further investigation to understand underlying bottlenecks or frequency scaling effects. Together, these observations underscore that raw vector width alone does not guarantee higher performance; the effectiveness of vector execution depends strongly on the interplay between hardware characteristics, compiler optimizations, and numerical precision choices.

To deepen this analysis, several lines of inquiry remain. Future work should examine the impact of vectorizing mathematical functions and their interaction with compiler optimizations. It would also be beneficial to explore tiling and blocking strategies in the matrix multiplication kernel, with the goal of determining whether the compiler can generate code comparable to hand-written intrinsics. Additional investigation into loop unrolling in the innermost multiplication loop may clarify its influence on vectorization efficiency. Further experiments with larger input sizes should be conducted to assess how vector instruction patterns and throughput evolve. Finally, a more detailed study of AVX behaviour on Intel architectures, particularly in relation to frequency scaling, is recommended.