Most of the program execution time is spent on cycles. One complication with parallelizing the sequential loop is that the body of the loop may also depend on the previous invocations of its self.
Let’s review a sequential loop. The following code performs matrix multiplication. It receives two input matrices A and B, and generates matrix C. The C matrix is assumed to be initialized to zero before calling the algorithm:
// Assuming C zero initialized: C(lxn) := A(lxm)·B(mxn)
void matmul_v0( const double* A, const double* B, double* C, const int l, const int m, const int n)
{
for (int i=0;i<l;i++ ) {
for (int j=0;j<n;j++ ) {
for (int k=0;k<m;k++ ) {
C[i*n+j] += A[i*m+k]*B[k*n+j];
}
}
}
}
This initial version of the matrix multiplication code presents a loop carried dependence. This, two consecutive iterations of the inner loop update the very same element in the C matrix (reported as vector). That will prevent the vectorization at this level. Let’s look at the optimization report of the compiler generated with -opt-report5.
LOOP BEGIN at matmul_v0.c(14,7)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between C[i*n+j] (15:9) and C[i*n+j] (15:9)
remark #15346: vector dependence: assumed ANTI dependence between C[i*n+j] (15:9) and C[i*n+j] (15:9)
LOOP END
The report indicates the compiler has vectorized the loop. The whole compiler report is: Intel(R) Advisor can now assist with vectorization and show optimization report messages with your source code. See “https://software.intel.com/en-us/intel-advisor-xe” for details.
Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.4.196 Build 20170411
Compiler options: -c -std=c99 -unroll0 -qopt-report-phase=vec -qopt-report=5 -qopt-report-file=stderr
Begin optimization report for: matmul_v0(const double *, const double *, double *, const int, const int, const int)
Report from: Vector optimizations [vec]
LOOP BEGIN at matmul_v0.c(12,3)
remark #15541: outer loop was not auto-vectorized: consider using SIMD directive
LOOP BEGIN at matmul_v0.c(13,5)
remark #15541: outer loop was not auto-vectorized: consider using SIMD directive
LOOP BEGIN at matmul_v0.c(14,7)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between C[i*n+j] (15:9) and C[i*n+j] (15:9)
remark #15346: vector dependence: assumed ANTI dependence between C[i*n+j] (15:9) and C[i*n+j] (15:9)
LOOP END
LOOP END
LOOP END
===========================================================================
The following code computes a vector addition. It uses two nput matrices (A and B) and computes a third one (C) by adding element per element of the former ones.
void vadd ( double *C, double *A, double *B, int n)
{
for (int i=0; i<n; i++) {
C[i] = A[i] + B[i];
}
}
Although vectorization is posible, the compiler is forced to create pre- and post- loops to be able to vectorize the inner set of iterations. The compiler is forced to create a peeled version (pre) due to it is not able to determine the alignment of the arrays involved in the loop. It is reported as:
LOOP BEGIN at vadd_v0.c(10,3)
<Peeled loop for vectorization, Multiversioned v1>
LOOP END
The compiler is forced to create a remainder version (post) due to it is not able to determine the size of the arrays involved in the loop. It is reported as:
LOOP BEGIN at vadd_v0.c(10,3)
<Remainder loop for vectorization, Multiversioned v1>
LOOP END
And finally, the compiler is forced to create an additional non-vectorized version of the loop due it is not able to determine, at compile time, if the involved arrays overlap in memory. Memory overlapping (i.e., aliasing) could create a data dependence imposible to detect by means of static analysis. The compiler reports this situation as ‘Multiversioned v2’:
LOOP BEGIN at vadd_v0.c(10,3)
<Multiversioned v2>
remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
LOOP END
The whole compiler report is: Intel(R) Advisor can now assist with vectorization and show optimization report messages with your source code. See “https://software.intel.com/en-us/intel-advisor-xe” for details.
Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 17.0.4.196 Build 20170411
Compiler options: -c -std=c99 -unroll0 -qopt-report-phase=vec -qopt-report=5 -qopt-report-file=stderr
Begin optimization report for: vadd(double *, double *, double *, int)
Report from: Vector optimizations [vec]
LOOP BEGIN at vadd_v0.c(10,3)
<Peeled loop for vectorization, Multiversioned v1>
LOOP END
LOOP BEGIN at vadd_v0.c(10,3)
<Multiversioned v1>
remark #15388: vectorization support: reference C[i] has aligned access [ vadd_v0.c(11,5) ]
remark #15389: vectorization support: reference A[i] has unaligned access [ vadd_v0.c(11,12) ]
remark #15388: vectorization support: reference B[i] has aligned access [ vadd_v0.c(11,19) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 2
remark #15309: vectorization support: normalized vectorization overhead 2.833
remark #15300: LOOP WAS VECTORIZED
remark #15442: entire loop may be executed in remainder
remark #15448: unmasked aligned unit stride loads: 1
remark #15449: unmasked aligned unit stride stores: 1
remark #15450: unmasked unaligned unit stride loads: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 8
remark #15477: vector cost: 3.000
remark #15478: estimated potential speedup: 2.580
remark #15488: --- end vector cost summary ---
LOOP END
LOOP BEGIN at vadd_v0.c(10,3)
<Alternate Alignment Vectorized Loop, Multiversioned v1>
LOOP END
LOOP BEGIN at vadd_v0.c(10,3)
<Remainder loop for vectorization, Multiversioned v1>
LOOP END
LOOP BEGIN at vadd_v0.c(10,3)
<Multiversioned v2>
remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
LOOP END
===========================================================================