Python loops (original)

The dimensions of the arrays are 4x10x5481x4x48, which are traversed two times (one in compute_step_1 and another in compute_step_2) via nested for-loops. Since one element of the first dimension is passed, the program ends up executing around 3x10x5481x4x48 iterations. All experiments were run with one core on an Intel i5-8365U CPU @ 1.60GHz processor.

The following Paraver image shows the execution structure of the pattern using 1 thread. 99% of the time is spent in the most expensive function, compute_step_1.

Compute Structure (master)

In the next chart different performance metrics of the kernel are displayed.

	Master
Total elapsed Time [s]	1545.69
compute_step_1 elapsed time [s]	1544.10
compute_step_2 elapsed time [s]	0.56
Total instructions	9.62e12
compute_step_1 instructions	9.62e12
compute_step_2 instructions	4.87e9
Total average IPC	2.02
compute_step_1 average IPC	2.02
compute_step_2 average IPC	2.63

Although IPC is good, the total number of instructions is huge. The naive Python version of this kernel is executing approximately 3.05e5 instructions per iteration in compute_step_1, a number that will decrease drastically in the next versions.