This version implements a set of best-practices to improve execution efficiency of for loops that are automaticaly vectorized by the compiler.
As in the For loops poor auto-vectorization, we explore two canonical examples (vector addition and matrix multiplication) and show, based on the compiler information, how we are able to efficiently vectorize the kernels.
Lets look at the vector addition example:
void vadd_v2(double * restrict c, double * restrict a, double * restrict b, int n)
{
#pragma vector aligned
for(int i=0; i<n; i++)
c[i]=a[i]+b[i];
}
Given the information provided by the compiler when we executed the pattern, we applied to the vadd_v2
version the following modifications:
restrict
pointer keyword was added to let the compiler know that variables are independent and do not overlap in memory.#pragma vector aligned
was added to inform the compiler that the respective data variables are aligned in memory.If we compile this new version, the compiler spells out the following report:
Begin optimization report for: vadd_v2(double *, double *, double *, int)
Report from: Vector optimizations [vec]
LOOP BEGIN at src/vadd_v2.c(5,2)
remark #15388: vectorization support: reference c[i] has aligned access [ src/vadd_v2.c(5,25) ]
remark #15388: vectorization support: reference a[i] has aligned access [ src/vadd_v2.c(5,30) ]
remark #15388: vectorization support: reference b[i] has aligned access [ src/vadd_v2.c(5,35) ]
remark #15305: vectorization support: vector length 2
remark #15300: LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 2
remark #15449: unmasked aligned unit stride stores: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 8
remark #15477: vector cost: 2.500
remark #15478: estimated potential speedup: 3.200
remark #15488: --- end vector cost summary ---
LOOP END
LOOP BEGIN at src/vadd_v2.c(5,2)
<Remainder loop for vectorization>
LOOP END
NOTE: This report was generated by the Intel C compiler version 17.0.4 and flags: -qopt-report-phase=vec
, -qopt-report=5
and -qopt-report-file=stderr
.
From the information provided bu the compiler we can underline the following:
This application can be compiled with the provided Makefile. Just type in a console:
make all
- Compiles all versions of the vadd and matmul kernels and creates the executable (in the bin directory): bp.allmake vadd
- Compiles all versions of the vadd kernel and creates the executable (in the bin directory) : bp.vaddmake vadd_v1
- Compiles the vadd_v1 kernel and creates the executable (in the bin directory) : bp.vadd_v1make vadd_v2
- Compiles the vadd_v2 kernel and creates the executable (in the bin directory) : bp.vadd_v2make matmul
- Compiles all versions of the matmul kernel and creates the executable (in the bin directory): bp.matmulmake matmul_v1
- Compiles the matmul_v1 kernel and creates the executable (in the bin directory): bp.matmul_v1make matmul_v2
- Compiles the matmul_v2 kernel and creates the executable (in the bin directory): bp.matmul_v2make matmul_v3
- Compiles the matmul_v3 kernel and creates the executable (in the bin directory): bp.matmul_v3To run the application, navigate to the bin directory and run the desired executable: e.g. ./bp.all
bp.all
- The application executes both kernels and respective implemented version (vadd and matmul);bp.vadd
- The application executes all versions of the vadd kernel;bp.vadd_v1
- The application executes the vadd_v1 kernel;bp.vadd_v2
- The application executes the vadd_v2 kernel;bp.matmul
- The application executes all version of the matmul kernel;bp.matmul_v1
- The application executes the matmul_v1 kernel;bp.matmul_v2
- The application executes the matmul_v2 kernel;bp.matmul_v3
- The application executes the matmul_v2 kernel;