Python loops (numba+numpy)

One of the proposals of the best-practice is to combine Numba JIT compiler with Numpy’s routines. The outcome of this optimization is shown below.

Compute Structure (master)

	Master	Numba	Numpy	Numba+Numpy
Total elapsed Time [s]	1545.69	261.70	1.67	15.51
compute_step_1 elapsed time [s]	1544.10	253.29	1.59	3.47
compute_step_2 elapsed time [s]	0.56	2.9e-3	0.02	0.01
Total instructions	9.62e12	1.24e12	1.34e10	9.84e10
compute_step_1 instructions	9.62e12	1.19e12	1.33e10	2.53e10
compute_step_2 instructions	4.87e9	5.25e6	9.19e7	9.19e7
Total average IPC	2.02	1.72	2.60	2.02
compute_step_1 average IPC	2.02	1.71	2.65	2.28
compute_step_2 average IPC	2.63	1.53	1.88	2.38

By combining Numba+Numpy we achieved a speedup of 99.66X. This version is slower than the pure Numpy one for two reasons: one, we can’t compile entirely the code with Numba without relying on Python’s interpreter because Numba doesn’t support all Numpy functions; two, Numba causes compilation overheads.