The following pseudo-code shows the implementation of the best-practice, when splitting of the tiles is performed on the columns of matrix \(\bf{B}\) and rows of matrix \(\bf{A}\).
given:
N_m, N_n, N_k, n_parts_B
execute:
get_my_process_id(proc_id)
get_number_of_processes(n_proc)
DEFINE first_b, last_b, len_n_part AS INTEGER
DEFINE first_a, last_a, n_parts_A, len_m_part AS INTEGER
DEFINE alpha, beta AS DOUBLE
first_b = float(N_n) * proc_id / n_parts_B
last_b = (float(N_n) * (proc_id + 1) / n_parts_B) - 1
n_parts_A = n_proc / n_parts_B
first_a = float(N_m) * proc_id / n_parts_A
last_a = (float(N_m) * (proc_id + 1) / n_parts_A) - 1
len_n_part = last_b - first_b + 1;
len_m_part = last_a - first_a + 1;
SET alpha TO 1.0
SET beta TO 0.0
call dgemm('N', 'N', len_m_part, len_n_part, N_k, alpha, A[:,first_a:last_a], len_m_part, B[:,first_b:last_b], N_k, beta ,C, len_m_part)
Code purpose:
dgemm_bestPractice.f90
can be used to measure parallel scaling to identify the optimal splitting of matrices \(\bf{A}\) and \(\bf{B}\). Since the purpose of the code is to undertake simple timing experiments, it doesn’t need to actually implement the parallel matrix-matrix multiplication, nor does it need to run on multiple compute nodes as the times would be identical. Hence, it simply replicates computation for a single compute node.
To simulate the computation on multiple nodes, the total number of nodes (nNodes) has to be provided as input parameter, the code will then split the matrix accordingly. The code output is a list of information to be used to understand the performance of the pattern, namely, the input variables passed by the user, the kernel wall time, two PAPI counters (PAPI_TOT_CYC, PAPI_TOT_INS), the IPC and frequency.
How to use:
The Makefile command make
generates an executable file named dgemm_bestPractice.exe
using the Intel Fortran compiler. The Makefile might need editing for different versions of BLAS (e.g. OpenBLAS), PAPI, and compilers (GNU compiler). The code should be run on a single compute node.
To run the code, first define the number of nodes to be simulated (nNodes), the integer values for the number of partitions for the columns of \(\bf{B}\) (no_n_parts), the matrices dimensions (n, m, and k) and the number of repatitions for the matrix multiplication (n_rep). The number of partitions for the rows of \(\bf{A}\) is computed from the number of nodes, processors per node and no_n_parts. Finally, launch the application.
mpirun -np <n_procs> ./dgemm_bestPractice.exe <nNodes> <n> <m> <k> <n_rep> <no_n_parts>
If an input value less than 1 is used, then default values is assigned to the corresponding parameter. If one of the parameter is missing then the code terminate with an error message explaining the usage of the code.
Screen output will be generated, similar to the following one:
> POP WP7 kernel
> Version of code: Best-practice version without performance bottleneck
> Implements Best-practice: Tuning BLAS routines invocation
> 48 number of processes per node
> 1 number of simulated nodes
> 100 number of repetitions for the block multiplication: A_v * B_w
> length of m: 3600
> length of m part: 600
> number of m part: 6
> length of n: 3600
> length of n part: 450
> number of n part: 8
> length of k: 3600
> length of k part: 3600
> number of k part: 1
> kernel wall time = 4.40777206420898 seconds
> PAPI_TOT_CYC : 8578256071
> PAPI_TOT_INS : 19642917992
> IPC : 2.29
> Frequency [GHz] : 1.95