GPU-kernel analysis of revised version

The analysis of the revised version of the matrix multiplication kernel was performed with NVIDIA Nsight Compute on the JUWELS booster system at JSC using one NVIDA A100 GPU. Nsight Compute reports:

==PROF== Connected to process 8634 
==PROF== Profiling "matrixMul" - 1: 0%....50%....100% - 10 passes
COMPLETED SUCCESSFULLY
==PROF== Disconnected from process 8634
[8634] a.out@127.0.0.1
  matrixMul(const int *, const int *, int *, int), 2021-Oct-25 21:11:36, Context 1, Stream 7
    Section: GPU Speed Of Light Throughput
    ---------------------------------------------------------------------- --------------- ------------------------------
    DRAM Frequency                                                           cycle/nsecond                           1,21
    SM Frequency                                                             cycle/nsecond                           1,09
    Elapsed Cycles                                                                   cycle                        891.447
    Memory [%]                                                                           %                          69,97
    DRAM Throughput                                                                      %                           0,66
    Duration                                                                       usecond                         815,52
    L1/TEX Cache Throughput                                                              %                          73,90
    L2 Cache Throughput                                                                  %                           7,77
    SM Active Cycles                                                                 cycle                     843.579,76
    Compute (SM) [%]                                                                     %                          69,87
    ---------------------------------------------------------------------- --------------- ------------------------------
    INF   Compute and Memory are well-balanced: To reduce runtime, both computation and memory traffic must be reduced. 
          Check both the Compute Workload Analysis and Memory Workload Analysis report sections.                        

    Section: Launch Statistics
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Size                                                                                                      1.024
    Function Cache Configuration                                                                  cudaFuncCachePreferNone
    Grid Size                                                                                                       1.024
    Registers Per Thread                                                   register/thread                             32
    Shared Memory Configuration Size                                                 Kbyte                           8,19
    Driver Shared Memory Per Block                                             Kbyte/block                           1,02
    Dynamic Shared Memory Per Block                                             byte/block                              0
    Static Shared Memory Per Block                                              byte/block                              0
    Threads                                                                         thread                      1.048.576
    Waves Per SM                                                                                                     4,74
    ---------------------------------------------------------------------- --------------- ------------------------------

    Section: Occupancy
    ---------------------------------------------------------------------- --------------- ------------------------------
    Block Limit SM                                                                   block                             32
    Block Limit Registers                                                            block                              2
    Block Limit Shared Mem                                                           block                            164
    Block Limit Warps                                                                block                              2
    Theoretical Active Warps per SM                                                   warp                             64
    Theoretical Occupancy                                                                %                            100
    Achieved Occupancy                                                                   %                          94,93
    Achieved Active Warps Per SM                                                      warp                          60,76
    ---------------------------------------------------------------------- --------------- ------------------------------
    INF   This kernel's theoretical occupancy is not impacted by any block limit.             

Compute and memory transfer are now well-balanced and no apparent issue could be found anymore.