Sequential loops

Most of the program execution time is spent on cycles. One complication with parallelizing the sequential loop is that the body of the loop may also depend on the previous invocations of its self.

Let’s review a sequential loop. The following code performs matrix multiplication. It receives two input matrices A and B, and generates matrix C. The C matrix is assumed to be initialized to zero before calling the algorithm:

// Assuming C zero initialized: C(lxn) := A(lxm)·B(mxn)
void matmul_v0( const double* A, const double* B, double* C, const int l, const int m, const int n)
{
  for (int i=0;i<l;i++ ) {
	for (int j=0;j<n;j++ ) {
	  for (int k=0;k<m;k++ ) {
		C[i*n+j] += A[i*m+k]*B[k*n+j];
	  }
	}
  }
}

This initial version of the matrix multiplication code presents a loop carried dependence. This, two consecutive iterations of the inner loop update the very same element in the C matrix (reported as vector). That will prevent the vectorization at this level. Let’s look at the optimization report of the compiler generated with -opt-report5.

LOOP BEGIN at matmul_v0.c(14,7)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed FLOW dependence between C[i*n+j] (15:9) and C[i*n+j] (15:9)
   remark #15346: vector dependence: assumed ANTI dependence between C[i*n+j] (15:9) and C[i*n+j] (15:9)
LOOP END

The report indicates the compiler has vectorized the loop. The whole compiler report is: Intel(R) Advisor can now assist with vectorization and show optimization report messages with your source code. See “https://software.intel.com/en-us/intel-advisor-xe” for details.

Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.4.196 Build 20170411

Compiler options: -c -std=c99 -unroll0 -qopt-report-phase=vec -qopt-report=5 -qopt-report-file=stderr

Begin optimization report for: matmul_v0(const double *, const double *, double *, const int, const int, const int)

	Report from: Vector optimizations [vec]


LOOP BEGIN at matmul_v0.c(12,3)
   remark #15541: outer loop was not auto-vectorized: consider using SIMD directive

   LOOP BEGIN at matmul_v0.c(13,5)
	  remark #15541: outer loop was not auto-vectorized: consider using SIMD directive

	  LOOP BEGIN at matmul_v0.c(14,7)
		 remark #15344: loop was not vectorized: vector dependence prevents vectorization
		 remark #15346: vector dependence: assumed FLOW dependence between C[i*n+j] (15:9) and C[i*n+j] (15:9)
		 remark #15346: vector dependence: assumed ANTI dependence between C[i*n+j] (15:9) and C[i*n+j] (15:9)
	  LOOP END
   LOOP END
LOOP END
===========================================================================

The following code computes a vector addition. It uses two nput matrices (A and B) and computes a third one (C) by adding element per element of the former ones.

void vadd ( double *C, double *A, double *B, int n)
{
  for (int i=0; i<n; i++) {
	C[i] = A[i] + B[i];
  }
}

Although vectorization is posible, the compiler is forced to create pre- and post- loops to be able to vectorize the inner set of iterations. The compiler is forced to create a peeled version (pre) due to it is not able to determine the alignment of the arrays involved in the loop. It is reported as:

LOOP BEGIN at vadd_v0.c(10,3)
<Peeled loop for vectorization, Multiversioned v1>
LOOP END

The compiler is forced to create a remainder version (post) due to it is not able to determine the size of the arrays involved in the loop. It is reported as:

LOOP BEGIN at vadd_v0.c(10,3)
<Remainder loop for vectorization, Multiversioned v1>
LOOP END

And finally, the compiler is forced to create an additional non-vectorized version of the loop due it is not able to determine, at compile time, if the involved arrays overlap in memory. Memory overlapping (i.e., aliasing) could create a data dependence imposible to detect by means of static analysis. The compiler reports this situation as ‘Multiversioned v2’:

LOOP BEGIN at vadd_v0.c(10,3)
<Multiversioned v2>
   remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
LOOP END

The whole compiler report is: Intel(R) Advisor can now assist with vectorization and show optimization report messages with your source code. See “https://software.intel.com/en-us/intel-advisor-xe” for details.

Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.4.196 Build 20170411

Compiler options: -c -std=c99 -unroll0 -qopt-report-phase=vec -qopt-report=5 -qopt-report-file=stderr

Begin optimization report for: vadd(double *, double *, double *, int)

	Report from: Vector optimizations [vec]


LOOP BEGIN at vadd_v0.c(10,3)
<Peeled loop for vectorization, Multiversioned v1>
LOOP END

LOOP BEGIN at vadd_v0.c(10,3)
<Multiversioned v1>
   remark #15388: vectorization support: reference C[i] has aligned access   [ vadd_v0.c(11,5) ]
   remark #15389: vectorization support: reference A[i] has unaligned access   [ vadd_v0.c(11,12) ]
   remark #15388: vectorization support: reference B[i] has aligned access   [ vadd_v0.c(11,19) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15305: vectorization support: vector length 2
   remark #15309: vectorization support: normalized vectorization overhead 2.833
   remark #15300: LOOP WAS VECTORIZED
   remark #15442: entire loop may be executed in remainder
   remark #15448: unmasked aligned unit stride loads: 1
   remark #15449: unmasked aligned unit stride stores: 1
   remark #15450: unmasked unaligned unit stride loads: 1
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 8
   remark #15477: vector cost: 3.000
   remark #15478: estimated potential speedup: 2.580
   remark #15488: --- end vector cost summary ---
LOOP END

LOOP BEGIN at vadd_v0.c(10,3)
<Alternate Alignment Vectorized Loop, Multiversioned v1>
LOOP END

LOOP BEGIN at vadd_v0.c(10,3)
<Remainder loop for vectorization, Multiversioned v1>
LOOP END

LOOP BEGIN at vadd_v0.c(10,3)
<Multiversioned v2>
   remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
LOOP END
===========================================================================
Recommended best-practice(s): Related program(s):
  • For loops poor auto-vectorization