Low computational performance calling BLAS routines

Version's name: Low computational performance calling BLAS routines ; a version of the BLAS Tuning program.
Repository: [home] and version downloads: [.zip] [.tar.gz] [.tar.bz2] [.tar]
Patterns and behaviours: Low computational performance calling BLAS routines (gemm) ·
Recommended best practices: Tuning BLAS routines invocation (gemm) ·

The following pseudo-code shows the implementation of the pattern, when splitting of the tiles is performed only on the columns of matrix \(\bf{B}\). Each process perfors the same amount of computation, the MKL dgemm subroutine is responspible for the tile’s multiplication of \(\bf{B}_b \times \bf{A}\), where \(\bf{B}_b\) is the matrix tile generated by splitting over the columns.

  given: 
   N_m, N_n, N_k

  execute:
   get_my_process_id(proc_id)
   get_number_of_processes(n_proc)

   DEFINE first, last, len_n_part AS INTEGER
   DEFINE alpha, beta AS DOUBLE

   first = float(N_n) * proc_id / n_proc
   last = (float(N_n) * (proc_id + 1) / n_proc) - 1

   len_n_part = last - first + 1;     

   SET alpha TO 1.0
   SET beta TO 0.0
   call dgemm('N','N',N_m, len_n_part, N_k,alpha ,A, N_m, B[:,first:last], N_k, beta ,C, N_m) 

Code purpose:

dgemm_pattern.f90 can be used to measure parallel scaling when splitting matrices \(\bf{B}\) and \(\bf{C}\) over their columns. The code replicates the computation for one compute node only as this is all that is required for timing, as the computation and time would be identical on the other nodes. The parallel scaling is computed by simulating the computation on \(nNodes\) nodes relative to one node.

The code output is a list of information to be used to understand the performance of the pattern, namely, the input variables passed by the user, the kernel wall time, two PAPI counters (PAPI_TOT_CYC, PAPI_TOT_INS), the IPC and frequency.

How to use:

The Makefile command make generates an executable file named dgemm_pattern.exe using the Intel Fortran compiler. The Makefile might need editing for different versions of BLAS (e.g. OpenBLAS), PAPI, and compilers (GNU compiler).

To simulate the computation on multiple nodes, the total number of nodes has to be provided as input parameter, the code will then split the matrix accordingly. The code should be run on a single compute node.

To run the code, first define the number of nodes to simulated (nNodes), the matrices dimensions (n, m, and k), the number of repatitions for the matrix multiplication (n_rep). The output timing refers to the splitting up of the multiplication of \(\bf {A} \times \bf {B}_b\) over \(n_{procs} \times nNodes\). A command-line example to launch the application is the following:

mpirun -np <n_procs> ./dgemm_pattern.exe <nNodes> <n> <m> <k> <n_rep>

If an input value less than 1 is used, then default values is assigned to the corresponding parameter. If one of the parameter is missing then the code terminate with an error message explaining the usage of the code.

Screen output will be generated, similar to the following one:

>   POP WP7 kernel
>   Version of code: Pattern version with performance bottleneck
>   Implements Pattern: Low computational performance calling BLAS routines
>   48  number of processes per node
>   1  number of simulated nodes
>   100 number of repetitions for the block multiplication: A * B_w
>   length of m:        3600
>   length of n:        3600
>   length of n part:          75
>   number of n part:          48
>   length of k:        3600
>   kernel wall time =    5.81519484519958      seconds
>   PAPI_TOT_CYC             : 11240960793
>   PAPI_TOT_INS             : 21290930932
>   PAPI_REF_CYC             : 11803331568
>   IPC                      :  1.89
>   Frequency [GHz]          :  2.03
The following experiments have been registered: