This version demonstrates the pattern of poor vectorization of for loops by the compiler. Whenever we encounter for loops in a given code, there is a possibility that the compiler will automatically optimize these loops and use, for instance, vector instructions to execute them. This process is heavilly dependent on the nature of the loop, the intelegence of the compiler and hints that the programmer can give to the compiler.
Here, we explore two canonical examples (vector addition and matrix multiplication) and show that simple implementations of these examples can present some dificulties to the compiler in terms of automatically compile the code to efficiently use vector instructions.
The information provided by the compiler is really valuable and can be used to assert the effectivness of the compiler and compile a given code to efficiently use vector instructions.
For instance, the vector addition example:
void vadd( double *c, double *a, double *b, int n){
for(int i=0; i<n; i++) c[i]=a[i]+b[i];
}
can be expressed as a simple for loop that iterates over arrays a
and b
and stores the addition result in array c
, element wise. Eventhough, this appears to be a straightforward example for the compiler to automatically vectorize the loop, the truth is that there are various nuances that the compiler takes into account when doing these kinds of transformations. Let’s look at the compiler message to get more insights on this:
Begin optimization report for: vadd(double *, double *, double *, int)
Report from: Vector optimizations [vec]
LOOP BEGIN at src/vadd.c(4,2)
<Peeled loop for vectorization, Multiversioned v1>
LOOP END
LOOP BEGIN at src/vadd.c(4,2)
<Multiversioned v1>
remark #15388: vectorization support: reference c[i] has aligned access [ src/vadd.c(4,25) ]
remark #15389: vectorization support: reference a[i] has unaligned access [ src/vadd.c(4,30) ]
remark #15388: vectorization support: reference b[i] has aligned access [ src/vadd.c(4,35) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 2
remark #15309: vectorization support: normalized vectorization overhead 2.833
remark #15300: LOOP WAS VECTORIZED
remark #15442: entire loop may be executed in remainder
remark #15448: unmasked aligned unit stride loads: 1
remark #15449: unmasked aligned unit stride stores: 1
remark #15450: unmasked unaligned unit stride loads: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 8
remark #15477: vector cost: 3.000
remark #15478: estimated potential speedup: 2.580
remark #15488: --- end vector cost summary ---
LOOP END
LOOP BEGIN at src/vadd.c(4,2)
<Alternate Alignment Vectorized Loop, Multiversioned v1>
LOOP END
LOOP BEGIN at src/vadd.c(4,2)
<Remainder loop for vectorization, Multiversioned v1>
LOOP END
LOOP BEGIN at src/vadd.c(4,2)
<Multiversioned v2>
remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
LOOP END
NOTE: This report was generated by the Intel C compiler version 17.0.4 and flags: -qopt-report-phase=vec
, -qopt-report=5
and -qopt-report-file=stderr
.
From the information provided by the compiler, we can underline the following:
This application can be compiled with the provided Makefile. Just type in a console:
make all
- Compiles both kernels and creates and the executable (in the bin directory): pattern.allmake vadd
- Compiles the vector addition kernel and creates the executable (in the bin directory) : pattern.vaddmake matmul
- Compiles the matrix multiplication kernel and creates the executable (in the bin directory): pattern.matmulTo run the application, navigate to the bin directory and run the desired executable: e.g. ./pattern.all
.
pattern.all
- The application executes both kernels (vadd and matmul);pattern.vadd
- The application executes the vadd kernel;pattern.matmul
- The application executes the matmul kernel;