In this experiment, we summarize the conclusions derived from the study conducted on the target RISC-V platforms. For additional details, the reader is referred to the following document:
Leonel Sousa, et al. : “D4.4 - Second report on codesign”, Version 1.0, Deliverable of the Performance Optimization and Productivity 3 (POP3) Centre of Excellence. December, 2025.
From a co-design perspective, several observations arise from that experiment.
First, the role of vector masks warrants closer examination. It is important to determine whether masking could be reduced or avoided entirely—for example, by loading all strided elements into vector registers unconditionally and applying the mask only during subsequent arithmetic operations.
If masks contain many zero entries, it is also relevant to investigate whether the hardware can avoid unnecessary cache-line fetches. Equally important is the question of whether analysis tools can expose the actual runtime mask values, which would greatly facilitate more detailed performance assessments.
Loop transformations constitute another area with potential for optimization. Applying loop interchange may reduce or eliminate reduction operations, and it remains uncertain whether the compiler can reliably detect and apply such transformations automatically.
Instruction-level overlap represents an additional opportunity: scalar instructions that follow a vectorized block within the same iteration could potentially be overlapped with the vectorized portion of subsequent iterations, provided that no data dependencies prevent this reordering.
With respect to memory access patterns, indirect indexed loads and stores should be avoided, as they generally inhibit the compiler’s ability to generate efficient vectorized code.