The analysis of the basic version of the matrix multiplication kernel was performed with NVIDIA Nsight Compute on the JUWELS booster system at JSC using one NVIDA A100 GPU. Nsight Compute reports:
==PROF== Connected to process 8711
==PROF== Profiling "matrixMul" - 1: 0%....50%....100% - 10 passes
COMPLETED SUCCESSFULLY
==PROF== Disconnected from process 8711
[8711] matmul_ori@127.0.0.1
matrixMul(const int *, const int *, int *, int), 2021-Oct-25 21:13:05, Context 1, Stream 7
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/nsecond 1,21
SM Frequency cycle/nsecond 1,09
Elapsed Cycles cycle 2.598.549
Memory [%] % 69,54
DRAM Throughput % 0,23
Duration msecond 2,39
L1/TEX Cache Throughput % 52,56
L2 Cache Throughput % 77,12
SM Active Cycles cycle 2.441.989,78
Compute (SM) [%] % 35,92
---------------------------------------------------------------------- --------------- ------------------------------
WRN Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis report section to see
where the memory system bottleneck is. Check memory replay (coalescing) metrics to make sure you're
efficiently utilizing the bytes transferred. Also consider whether it is possible to do more work per memory
access (kernel fusion) or whether there are values you can (re)compute.
Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1.024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 1.024
Registers Per Thread register/thread 28
Shared Memory Configuration Size Kbyte 8,19
Driver Shared Memory Per Block Kbyte/block 1,02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 1.048.576
Waves Per SM 4,74
---------------------------------------------------------------------- --------------- ------------------------------
Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 32
Block Limit Registers block 2
Block Limit Shared Mem block 164
Block Limit Warps block 2
Theoretical Active Warps per SM warp 64
Theoretical Occupancy % 100
Achieved Occupancy % 97,67
Achieved Active Warps Per SM warp 62,51
---------------------------------------------------------------------- --------------- ------------------------------
INF This kernel's theoretical occupancy is not impacted by any block limit.
The warning clearly indicates that there is an issue with memory.