Low computational performance calling BLAS routines (gemm)

Usual symptom(s):
  • IPC Scaling: The IPC Scaling (IPCS) compares IPC to the reference case. (more...)

In many scientific applications, parallel matrix-matrix multiplications represent the computational bottleneck. Consider the following matrix-matrix multiplication:

\[\bf{C}= \bf {A} \times \bf {B}\]

where, \(\bf {A}\) is a matrix with \(N_m\) rows and \(N_k\) columns (i.e. \(N_m \times N_k\)), \(\bf {B}\) is a \(N_k \times N_n\) matrix so that \(\bf {C}\) is a \(N_m \times N_n\) matrix.

In the case where each process has access to \(\bf {A}\) and \(\bf {B}\), a strategy to parallelize the work is to divide \(\bf{B}\) into blocks so as to partition the computation over all the processes, and use the BLAS dgemm routine on the individual blocks. A sketch of the aforementioned strategy can be represented as follows, where for simplicity we assume the number of columns splits evenly over the processes:

\[\bf {C}_b = \bf {A} \times \bf {B}_b\]

where \(\bf {C}_b\) and \(\bf{B}_b\) are matrix blocks generated by splitting over the columns. For \(p\) processes, the blocks \(\bf {B}_b\) are \(N_k \times N_w\), where \(N_w=N_n/p\), and the blocks \(\bf{C}_b\) are \(N_m \times N_w\). In this way, the higher the number of processes \(p\), the smaller the number of columns of \(\bf{B}_b\) and \(\bf{C}_b\). The pseudo-code implementation of the aforementioned algorithm can be:

  given:
   N_m, N_n, N_k

  execute:
   get_my_process_id(proc_id)
   get_number_of_processes(n_proc)

   DEFINE first, last, len_n_part AS INTEGER
   DEFINE alpha, beta AS DOUBLE

   first = float(N_n) * proc_id / n_proc
   last = (float(N_n) * (proc_id + 1) / n_proc) - 1

   len_n_part = last - first + 1;

   SET alpha TO 1.0
   SET beta TO 0.0
   call dgemm('N','N',N_m, len_n_part, N_k,alpha ,A, N_m, B[:,first:last], N_k, beta ,C, N_m)

In an ideal case, when the number of processes is doubled, the computational time halves. This strategy has been used in software implementing density functional theory (DFT). However, a POP Audit (see related report bellow) for one such application showed a low global efficiency and a non-optimal performance. One of the sources for the lack of performance was the poor Computation Scaling due to the aforementioned splitting of the matrix-matrix multiplications.

Further work undertaken by POP in a ‘Proof of Concept’ study identified that certain splitting of one (or both) input matrices can result in non-optimal performance. The best-practice page is intended to show the importance of undertaking experiments to identify optimal matrix splittings to ensure good computational performance.

Speedup tests

A simple benchmark code is used to measure the parallel scaling for the pattern, by replicating the amount of computation for \(\bf{A} \times \bf{B}_b\) for processes on a single compute node. For timing purposes there is no need to run the actual computation on all compute nodes, since the computation per node is identical.The three matrices have size \(N_n \times N_n\) where \(N_n = 3600\).

The scaling plot below was obtained using the sequential dgemm subroutine from MKL in the pattern described above. For block multiplication dgemm parameters \(alpha\) and \(beta\) are set to \(1.0\) and \(0.0\), respectively.

The data below is for a fully loaded compute node on MareNostrum-4 (48 processes), the timing is measured for each process and computation is repeated \(100 \times\) to give measurable timings.

speedup_pattern Figure 1: speedup plot.

It can be observed that partitioning the work by slitting matrix \(\bf{B}\) over the columns results in a reduced speedup as the number of nodes increases.

Recommended best-practice(s): Related program(s):
  • Low computational performance calling BLAS routines
  • Related report(s):
  • BAND