Applications executed on RISC-V platforms may exhibit lower-than-expected computational performance, independently of the hierarchy of parallel efficiency metrics (i.e., load balance or communication efficiencies). In such situations, the application may show low instruction throughput, poor vector unit utilization, or significantly reduced floating-point performance compared with the theoretical peak capabilities. This behavior may occur in both single-core and multi-core executions and therefore directly impacts computational scalability.
The associated symptoms may be complemented by several additional observations. These are not necessarily present in all cases and are not mutually exclusive:
• Low achieved FLOPS or instructions per cycle compared with the architectural peak.
• Reduced benefits from vectorization despite the presence of vector instructions.
• Limited performance improvement when enabling compiler optimization flags.
• High sensitivity to memory latency or irregular memory access patterns.
• Significant differences between scalar and vector code performance that cannot be explained solely by algorithmic complexity.
These observations indicate that the application does not effectively exploit the architectural capabilities of the processor. In many cases, improving computational performance on RISC-V platforms requires coordinated efforts at multiple levels, including compiler improvements, algorithmic adjustments, and architectural refinements.
Memory subsystem limitations. Since many HPC kernels have relatively low arithmetic intensity, performance is frequently limited by memory bandwidth or latency. On early RISC-V platforms or prototypes, the memory subsystem may exhibit higher latency or reduced bandwidth, which directly affects computational throughput.
HW Co-design (memory): Memory configurations should be better dimensioned in HPC systems. In the presence of Vector Processing Units (VPUs), the response of the memory subsystem when loading (or storing) a vector register becomes a fundamental component of overall performance.
Compiler and toolchain limitations. The RISC-V ecosystem is still evolving, and compiler support for some architectural features (especially vector extensions) is not yet as mature as for established architectures.
SW Co-design (compiler toolchain): Compilers should improve the implementation of cost-model algorithms used in auto-vectorization analysis. Loop transformation techniques may also enable the vectorization of loops that are currently not vectorized.
Inefficient vectorization or insufficient vector length. Many RISC-V processors (e.g., EPAC 1.5) rely heavily on the RISC-V Vector Extension (RVV) to deliver computational throughput. However, some algorithms expose only short vector lengths or irregular loop structures. When the effective vector length is significantly smaller than the architectural maximum, the vector unit remains underutilized.
Irregular memory access patterns. Applications such as sparse linear algebra kernels often rely on indirect memory accesses or gather operations. These patterns reduce spatial and temporal locality and may increase cache miss rates.