The analysis of the revised version of the matrix multiplication kernel was performed with NVIDIA Nsight Compute on the JUWELS booster system at JSC using one NVIDA A100 GPU. Nsight Compute reports:
==PROF== Connected to process 8634
==PROF== Profiling "matrixMul" - 1: 0%....50%....100% - 10 passes
COMPLETED SUCCESSFULLY
==PROF== Disconnected from process 8634
[8634] a.out@127.0.0.1
matrixMul(const int *, const int *, int *, int), 2021-Oct-25 21:11:36, Context 1, Stream 7
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/nsecond 1,21
SM Frequency cycle/nsecond 1,09
Elapsed Cycles cycle 891.447
Memory [%] % 69,97
DRAM Throughput % 0,66
Duration usecond 815,52
L1/TEX Cache Throughput % 73,90
L2 Cache Throughput % 7,77
SM Active Cycles cycle 843.579,76
Compute (SM) [%] % 69,87
---------------------------------------------------------------------- --------------- ------------------------------
INF Compute and Memory are well-balanced: To reduce runtime, both computation and memory traffic must be reduced.
Check both the Compute Workload Analysis and Memory Workload Analysis report sections.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1.024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1.024
Registers Per Thread register/thread 32
Shared Memory Configuration Size Kbyte 8,19
Driver Shared Memory Per Block Kbyte/block 1,02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 1.048.576
Waves Per SM 4,74
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 2
Block Limit Shared Mem block 164
Block Limit Warps block 2
Theoretical Active Warps per SM warp 64
Theoretical Occupancy % 100
Achieved Occupancy % 94,93
Achieved Active Warps Per SM warp 60,76
---------------------------------------------------------------------- --------------- ------------------------------
INF This kernel's theoretical occupancy is not impacted by any block limit.
Compute and memory transfer are now well-balanced and no apparent issue could be found anymore.