For loops full auto-vectorization

Version's name: For loops full auto-vectorization ; a version of the For loops auto-vectorization program.
Repository: [home] and version downloads: [.zip] [.tar.gz] [.tar.bz2] [.tar]
Implemented best practices: Effective auto vectorization ·

This version implements a set of best-practices to improve execution efficiency of for loops that are automaticaly vectorized by the compiler.

As in the For loops poor auto-vectorization, we explore two canonical examples (vector addition and matrix multiplication) and show, based on the compiler information, how we are able to efficiently vectorize the kernels.

Lets look at the vector addition example:

void vadd_v2(double * restrict c, double * restrict a, double * restrict b, int n)
{
	#pragma vector aligned
	for(int i=0; i<n; i++)
		c[i]=a[i]+b[i];
}

Given the information provided by the compiler when we executed the pattern, we applied to the vadd_v2 version the following modifications:

The restrict pointer keyword was added to let the compiler know that variables are independent and do not overlap in memory.
The directive #pragma vector aligned was added to inform the compiler that the respective data variables are aligned in memory.

If we compile this new version, the compiler spells out the following report:

Begin optimization report for: vadd_v2(double *, double *, double *, int)

	Report from: Vector optimizations [vec]


LOOP BEGIN at src/vadd_v2.c(5,2)
	remark #15388: vectorization support: reference c[i] has aligned access   [ src/vadd_v2.c(5,25) ]
	remark #15388: vectorization support: reference a[i] has aligned access   [ src/vadd_v2.c(5,30) ]
	remark #15388: vectorization support: reference b[i] has aligned access   [ src/vadd_v2.c(5,35) ]
	remark #15305: vectorization support: vector length 2
	remark #15300: LOOP WAS VECTORIZED
	remark #15448: unmasked aligned unit stride loads: 2 
	remark #15449: unmasked aligned unit stride stores: 1 
	remark #15475: --- begin vector cost summary ---
	remark #15476: scalar cost: 8 
	remark #15477: vector cost: 2.500 
	remark #15478: estimated potential speedup: 3.200 
	remark #15488: --- end vector cost summary ---
LOOP END

LOOP BEGIN at src/vadd_v2.c(5,2)
<Remainder loop for vectorization>
LOOP END

NOTE: This report was generated by the Intel C compiler version 17.0.4 and flags: -qopt-report-phase=vec, -qopt-report=5 and -qopt-report-file=stderr.

From the information provided bu the compiler we can underline the following:

The compiler is able to determine the aligment of the data.
No peeled version of the loop was created.
The entire loop was able to be vectorized.

How to build

This application can be compiled with the provided Makefile. Just type in a console:

make all - Compiles all versions of the vadd and matmul kernels and creates the executable (in the bin directory): bp.all
make vadd - Compiles all versions of the vadd kernel and creates the executable (in the bin directory) : bp.vadd
make vadd_v1 - Compiles the vadd_v1 kernel and creates the executable (in the bin directory) : bp.vadd_v1
make vadd_v2 - Compiles the vadd_v2 kernel and creates the executable (in the bin directory) : bp.vadd_v2
make matmul - Compiles all versions of the matmul kernel and creates the executable (in the bin directory): bp.matmul
make matmul_v1 - Compiles the matmul_v1 kernel and creates the executable (in the bin directory): bp.matmul_v1
make matmul_v2 - Compiles the matmul_v2 kernel and creates the executable (in the bin directory): bp.matmul_v2
make matmul_v3 - Compiles the matmul_v3 kernel and creates the executable (in the bin directory): bp.matmul_v3

How to run

To run the application, navigate to the bin directory and run the desired executable: e.g. ./bp.all

bp.all - The application executes both kernels and respective implemented version (vadd and matmul);
bp.vadd - The application executes all versions of the vadd kernel;
bp.vadd_v1 - The application executes the vadd_v1 kernel;
bp.vadd_v2 - The application executes the vadd_v2 kernel;
bp.matmul - The application executes all version of the matmul kernel;
bp.matmul_v1 - The application executes the matmul_v1 kernel;
bp.matmul_v2 - The application executes the matmul_v2 kernel;
bp.matmul_v3 - The application executes the matmul_v2 kernel;

The following experiments have been registered:

For loops full auto-vectorization

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 676553 (POP1) and 824080 (POP2).

Currently, the project receives funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101143931 (POP3). The JU receives support from the European Union's Horizon Europe research and innovation programme and Spain, Germany, France, Portugal and the Czech Republic.