The dimensions of the arrays are 4x10x5481x4x48, which are traversed two times (one in compute_step_1
and another in compute_step_2
) via nested for-loops. Since one element of the first dimension is passed, the program ends up executing around 3x10x5481x4x48 iterations. All experiments were run with one core on an Intel i5-8365U CPU @ 1.60GHz processor.
The following Paraver image shows the execution structure of the pattern using 1 thread. 99% of the time is spent in the most expensive function, compute_step_1
.
In the next chart different performance metrics of the kernel are displayed.
Master | |
---|---|
Total elapsed Time [s] | 1545.69 |
compute_step_1 elapsed time [s] | 1544.10 |
compute_step_2 elapsed time [s] | 0.56 |
Total instructions | 9.62e12 |
compute_step_1 instructions | 9.62e12 |
compute_step_2 instructions | 4.87e9 |
Total average IPC | 2.02 |
compute_step_1 average IPC | 2.02 |
compute_step_2 average IPC | 2.63 |
Although IPC is good, the total number of instructions is huge. The naive Python version of this kernel is executing approximately 3.05e5 instructions per iteration in compute_step_1
, a number that will decrease drastically in the next versions.