For loops poor auto-vectorization

Version's name: For loops poor auto-vectorization ; a version of the For loops auto-vectorization program.
Repository: [home] and version downloads: [.zip] [.tar.gz] [.tar.bz2] [.tar]
Patterns and behaviours:

Sequential loops

Recommended best-practices:

Effective auto vectorization

- Available version(s):

For loops full auto-vectorization

This version demonstrates the pattern of poor vectorization of for loops by the compiler. Whenever we encounter for loops in a given code, there is a possibility that the compiler will automatically optimize these loops and use, for instance, vector instructions to execute them. This process is heavilly dependent on the nature of the loop, the intelegence of the compiler and hints that the programmer can give to the compiler.

Here, we explore two canonical examples (vector addition and matrix multiplication) and show that simple implementations of these examples can present some dificulties to the compiler in terms of automatically compile the code to efficiently use vector instructions.

The information provided by the compiler is really valuable and can be used to assert the effectivness of the compiler and compile a given code to efficiently use vector instructions.

For instance, the vector addition example:

void vadd( double *c, double *a, double *b, int n){
	for(int i=0; i<n; i++) c[i]=a[i]+b[i];
}

can be expressed as a simple for loop that iterates over arrays a and b and stores the addition result in array c, element wise. Eventhough, this appears to be a straightforward example for the compiler to automatically vectorize the loop, the truth is that there are various nuances that the compiler takes into account when doing these kinds of transformations. Let’s look at the compiler message to get more insights on this:

Begin optimization report for: vadd(double *, double *, double *, int)

    Report from: Vector optimizations [vec]


		LOOP BEGIN at src/vadd.c(4,2)
		<Peeled loop for vectorization, Multiversioned v1>
		LOOP END

		LOOP BEGIN at src/vadd.c(4,2)
		<Multiversioned v1>
		   remark #15388: vectorization support: reference c[i] has aligned access   [ src/vadd.c(4,25) ]
			 remark #15389: vectorization support: reference a[i] has unaligned access   [ src/vadd.c(4,30) ]
			 remark #15388: vectorization support: reference b[i] has aligned access   [ src/vadd.c(4,35) ]
			 remark #15381: vectorization support: unaligned access used inside loop body
			 remark #15305: vectorization support: vector length 2
			 remark #15309: vectorization support: normalized vectorization overhead 2.833
			 remark #15300: LOOP WAS VECTORIZED
			 remark #15442: entire loop may be executed in remainder
			 remark #15448: unmasked aligned unit stride loads: 1 
			 remark #15449: unmasked aligned unit stride stores: 1 
			 remark #15450: unmasked unaligned unit stride loads: 1 
			 remark #15475: --- begin vector cost summary ---
			 remark #15476: scalar cost: 8 
			 remark #15477: vector cost: 3.000 
			 remark #15478: estimated potential speedup: 2.580 
			 remark #15488: --- end vector cost summary ---
		LOOP END

		LOOP BEGIN at src/vadd.c(4,2)
		<Alternate Alignment Vectorized Loop, Multiversioned v1>
		LOOP END

		LOOP BEGIN at src/vadd.c(4,2)
		<Remainder loop for vectorization, Multiversioned v1>
		LOOP END

		LOOP BEGIN at src/vadd.c(4,2)
		<Multiversioned v2>
		remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
		LOOP END

NOTE: This report was generated by the Intel C compiler version 17.0.4 and flags: -qopt-report-phase=vec, -qopt-report=5 and -qopt-report-file=stderr.

From the information provided by the compiler, we can underline the following:

A peeled loop version was created because the compiler is not able to determine the alignment of the arrays.
A remainder loop version was created because the compiler is not able to determine the size of the arrays.
An additional non-vectorized version of the loop was created because the compiler is not able to determine, at compile time, if the involved arrays overlap in memory.

How to build

This application can be compiled with the provided Makefile. Just type in a console:

make all - Compiles both kernels and creates and the executable (in the bin directory): pattern.all
make vadd - Compiles the vector addition kernel and creates the executable (in the bin directory) : pattern.vadd
make matmul - Compiles the matrix multiplication kernel and creates the executable (in the bin directory): pattern.matmul

How to run

To run the application, navigate to the bin directory and run the desired executable: e.g. ./pattern.all.

pattern.all - The application executes both kernels (vadd and matmul);
pattern.vadd - The application executes the vadd kernel;
pattern.matmul - The application executes the matmul kernel;

The following experiments have been registered:

For loops poor auto-vectorization

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 676553 (POP1) and 824080 (POP2).

Currently, the project receives funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101143931 (POP3). The JU receives support from the European Union's Horizon Europe research and innovation programme and Spain, Germany, France, Portugal and the Czech Republic.