GPU-kernel analysis of initial version

The analysis of the basic version of the matrix multiplication kernel was performed with NVIDIA Nsight Compute on the JUWELS booster system at JSC using one NVIDA A100 GPU. Nsight Compute reports:

==PROF== Connected to process 8711 
==PROF== Profiling "matrixMul" - 1: 0%....50%....100% - 10 passes
COMPLETED SUCCESSFULLY
==PROF== Disconnected from process 8711
[8711] matmul_ori@127.0.0.1
  matrixMul(const int *, const int *, int *, int), 2021-Oct-25 21:13:05, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           1,21
    SM Frequency                                                             cycle/nsecond                           1,09
    Elapsed Cycles                                                                   cycle                      2.598.549
    Memory [%]                                                                           %                          69,54
    DRAM Throughput                                                                      %                           0,23
    Duration                                                                       msecond                           2,39
    L1/TEX Cache Throughput                                                              %                          52,56
    L2 Cache Throughput                                                                  %                          77,12
    SM Active Cycles                                                                 cycle                   2.441.989,78
    Compute (SM) [%]                                                                     %                          35,92
    ---------------------------------------------------------------------- --------------- ------------------------------
    WRN   Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis report section to see      
          where the memory system bottleneck is. Check memory replay (coalescing) metrics to make sure you're           
          efficiently utilizing the bytes transferred. Also consider whether it is possible to do more work per memory  
          access (kernel fusion) or whether there are values you can (re)compute.                                       

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                      1.024
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       1.024
    Registers Per Thread                                                   register/thread                             28
    Shared Memory Configuration Size                                                 Kbyte                           8,19
    Driver Shared Memory Per Block                                             Kbyte/block                           1,02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                      1.048.576
    Waves Per SM                                                                                                     4,74
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                              2
    Block Limit Shared Mem                                                           block                            164
    Block Limit Warps                                                                block                              2
    Theoretical Active Warps per SM                                                   warp                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          97,67
    Achieved Active Warps Per SM                                                      warp                          62,51
    ---------------------------------------------------------------------- --------------- ------------------------------
    INF   This kernel's theoretical occupancy is not impacted by any block limit.                                       

The warning clearly indicates that there is an issue with memory.

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 676553 (POP1) and 824080 (POP2).

Currently, the project receives funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101143931 (POP3). The JU receives support from the European Union's Horizon Europe research and innovation programme and Spain, Germany, France, Portugal and the Czech Republic.