Effective auto vectorization

Pattern addressed: Sequential loops

Most of the program execution time is spent on cycles. One complication with parallelizing the sequential loop is that the body of the loop may also depend on the previous invocations of its self. (more...)

A number of methods of automatic generation of vector instructions have been developed for optimizing compilers. They allow vectorizing code without control branches or with some template branches. Consider the Automatic vectorization of a vectoring allowing to vectorize cycles without management branches. The basic idea of the Automatic vectorization consists of Loop Unroll to create copies of the original scalar expression and further replacement by vector instructions of groups of scalar instructions. Automatic vectorization is a special case of automatic paralleling where the program code is converted from a scalar implementation that processes one operation on several operand pairs at once.

To vectorize a program code, the compiler optimizer must first understand the dependence between statements and align them if necessary. After the dependencies are compared, the optimizer should correctly arrange the instructions implementing changes of corresponding candidates in vector instructions that work on several data elements.

There are several issues that need to be understood for effective optimization:

Data layout in memory aligned. A simple vector load from aligned memory location will give optimal performance for that load.
The extra instructions and data accesses reduce performance.

The compilers still have limitations, here is the list of requirements in order for a loop to potentially vectorized:

If a loop is a part of a loop nest, it must be the inner loop.
The loop must contain straight-line code. There should be no jumps or branches.
The loop must be countable, which means that the number of iterations must be known before the loop starts to execute. There must be no data-dependent exit conditions.
There should be no backward loop-carried dependencies. For example, the loop must not require statement 2 of iteration 1 to be executed before statement 1 of iteration 2.

Because vectorization happens in inner loop consisting of straight-line code, special operators and function or subroutine calls is critical. Intrinsic math functions such as sin(), log() are allowed, since the compiler runtime library contains vectorized versions of these functions. The following math functions may be vectorized by the compiler: sin, cos, tan, asin, acos, atan, log, log2, log10, exp and etc. For further information about vectorized versions of functions see the compiler manual.

Following with the example we have seen in the related pattern, we should notice the compiler report is detecting a loop-carried dependence. To avoid it we can change the order of nested loops in a way that the inner loop updates, at each iteration, different elements of the array C. Changes are reflected in the following code:

void matmul_v1( const double* A, const double* B, double* C, const int l, const int m, const int n)
{
  for (int i=0;i<l;i++ ) {
    for (int k=0;k<m;k++ ) {
      for (int j=0;j<n;j++ ) {
        C[i*n+j] += A[i*m+k]*B[k*n+j];
      }
    }
  }
}

With this change, the compiler is able to vectorize the inner loop. But it raises the two problems already shown in the related pattern: memory alignment (peeled loop: Peeled loop for vectorization) and aliasing (alternative version: Multiversioned v2). The (simplified) report now reads:

LOOP BEGIN at matmul_v1.c(11,3)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at matmul_v1.c(12,5)
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at matmul_v1.c(13,7)
      <Peeled loop for vectorization, Multiversioned v1>
      LOOP END

      LOOP BEGIN at matmul_v1.c(13,7)
      <Multiversioned v1>
         ...
         remark #15300: LOOP WAS VECTORIZED
         ...
      LOOP END

      LOOP BEGIN at matmul_v1.c(13,7)
      <Alternate Alignment Vectorized Loop, Multiversioned v1>
      LOOP END

      LOOP BEGIN at matmul_v1.c(13,7)
      <Remainder loop for vectorization, Multiversioned v1>
      LOOP END

      LOOP BEGIN at matmul_v1.c(13,7)
      <Multiversioned v2>
         remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
      LOOP END
   LOOP END
LOOP END

It is better to explicitly let the compiler know that variables are independent (i.e., they do not overlap in memory). One way to do it is to use restrict pointer keyword. Another way is to use the ivdep directive. A new version of the matrix multiplication algorithm annotates all the matrices with the restrict attribute:

#if defined(INTEL_COMPILER)
  #define RESTRICT restrict
#else /* non-Intel compiler */
  #define RESTRICT __restrict
#endif
// Assuming C zero initialized: C(lxn) := A(lxm)·B(mxn)
void matmul_v2( const double *RESTRICT A, const double *RESTRICT B, double *RESTRICT C, const int l, const int m, const int n)
{
  for (int i=0;i<l;i++ ) {
    for (int k=0;k<m;k++ ) {
      for (int j=0;j<n;j++ ) {
        C[i*n+j] += A[i*m+k]*B[k*n+j];
      }
    }
  }
}

That provides extra information to the compiler avoiding to detect at runtime if matrices overlap in memory and also avoiding to generate a second non-vectorized version of the algorithm, but still generating a peeled version (because it is not able to detect the matrices are aligned).

It is important for optimization to inform the compiler of the alignment information. For a specific variable, the __assume_aligned macro can be used to inform the compiler that a particular variable is aligned. A more general directive can be put in front of a loop to tell the compiler that all data in the loop is aligned. The resulting code will be:

#if defined(INTEL_COMPILER)
  #define RESTRICT restrict
#else /* non-Intel compiler */
  #define RESTRICT __restrict
#endif

// Assuming C zero initialized: C(lxn) := A(lxm)·B(mxn)
void matmul_v3( const double * RESTRICT A, const double *RESTRICT B, double *RESTRICT C, const int l, const int m, const int n)
{
  for (int i=0;i<l;i++ ) {
    for (int k=0;k<m;k++ ) {
#pragma vector aligned
      for (int j=0;j<n;j++ ) {
        C[i*n+j] += A[i*m+k]*B[k*n+j];
      }
    }
  }
}

In is important to notice that matrices involved in the algorithm must actually fulfill with that restriction, that is, the programmer must guarantee the allocation of these matrices is actually aligned.

After applying these changes to the matrix multiplication code we can verify that peeled loop (i.e., Peeled loop for vectorization) and aliasing (i.e., Multiversioned v2) have disappeared from the report. The simplified report now reads:

LOOP BEGIN at matmul_v3.c(13,3)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at matmul_v3.c(14,5)
      remark #15542: loop was not vectorized: inner loop was already vectorized

      LOOP BEGIN at matmul_v3.c(16,7)
         remark #15388: vectorization support: reference C[i*n+j] has aligned access   [ matmul_v3.c(17,9) ]
         remark #15388: vectorization support: reference C[i*n+j] has aligned access   [ matmul_v3.c(17,9) ]
         remark #15388: vectorization support: reference B[k*n+j] has aligned access   [ matmul_v3.c(17,30) ]
         remark #15305: vectorization support: vector length 2
         remark #15309: vectorization support: normalized vectorization overhead 0.250
         remark #15300: LOOP WAS VECTORIZED
         remark #15448: unmasked aligned unit stride loads: 2
         remark #15449: unmasked aligned unit stride stores: 1
         remark #15475: --- begin vector cost summary ---
         remark #15476: scalar cost: 11
         remark #15477: vector cost: 4.000
         remark #15478: estimated potential speedup: 2.740
         remark #15488: --- end vector cost summary ---
      LOOP END

      LOOP BEGIN at matmul_v3.c(16,7)
      <Remainder loop for vectorization>
      LOOP END
   LOOP END
LOOP END

Applying these two later changes (i.e., restrict and aligned) on the vadd() function (shown in the related pattern) will produce the following code:

#if defined(INTEL_COMPILER)
  #define RESTRICT restrict
#else /* non-Intel compiler */
  #define RESTRICT __restrict
#endif
void vadd( double * RESTRICT c, double * RESTRICT a, double * RESTRICT b, int n)
{
  int i;
  #pragma vector aligned
  for (i = 0; i < n; i++) {
    c[i] = a[i] + b[i];
  }
}

Recommended in program(s): For loops poor auto-vectorization ·

Implemented in program(s): For loops full auto-vectorization ·