The dimensions of the arrays are 4x10x5481x4x48, which are traversed two times (one in
compute_step_1 and another in
compute_step_2) via nested for-loops. Since one element of the first dimension is passed, the program ends up executing around 3x10x5481x4x48 iterations. All experiments were run with one core on an Intel i5-8365U CPU @ 1.60GHz processor.
The following Paraver image shows the execution structure of the pattern using 1 thread. 99% of the time is spent in the most expensive function,
In the next chart different performance metrics of the kernel are displayed.
|Total elapsed Time [s]
|compute_step_1 elapsed time [s]
|compute_step_2 elapsed time [s]
|Total average IPC
|compute_step_1 average IPC
|compute_step_2 average IPC
Although IPC is good, the total number of instructions is huge. The naive Python version of this kernel is executing approximately 3.05e5 instructions per iteration in
compute_step_1, a number that will decrease drastically in the next versions.