Home Resources for Co-Design Best-practices Numba-numpy-python-serial-efficiency

Usage of Numba and Numpy to improve Python's serial efficiency

Pattern addressed: Inefficient Python loops

With Python it is very easy to unconsciously produce extremely inefficient code as it is an interpreted language. One need to put special attention on the data types and sentences used in order to mitigate interpreter’s overhead since generic Python objects are several orders of magnitude slower than other alternatives. Therefore, after the prototyping phases when developing Python software, users need to identify the heaviest compute functions and apply to them the most suitable optimization. (more...)

When Python applications have heavy computing functions using generic data types it is possible to drastically increase sequential performance by using Numba or Numpy. Both packages can be used at the same time or separately depending on code’s needs.

Assuming that we have a time-consuming function like this one:

def traverse_and_compute(arr):
    for i in range(len(arr)):
        for j in range(len(arr[i])):
            if (arr[i][j] % 2) == 0:
                arr[i][j] = (arr[i][j] + 1) / 2
            else:
                arr[i][j] = 0

We can try to compile it with Numba to get much better performance. Applying Numba is as easy as adding a @jit decorator (after importing the Numba package):

from numba import jit

@jit(nopython=True)
def traverse_and_compute(arr):
    for i in range(len(arr)):
        for j in range(len(arr[i])):
            if (arr[i][j] % 2) == 0:
                arr[i][j] = (arr[i][j] + 1) / 2
            else:
                arr[i][j] = 0

With just one line of extra code, this function is going to be compiled at runtime and replaced by optimized machine code. The nopython=True argument prevents Numba to fall back using Python objects in case the compiler cannot infer the data types. Numba offers more performance tuning options like automatic parallelization, fastmath, or Intel’s linear algebra.

Another possible optimization is to remove the loops exploiting Numpy vectorization:

import numpy as np

def traverse_and_compute(arr):
    return (np.where((arr % 2 == 0), (arr + 1) // 2, 0))

The code is not only neat, but much faster as we are telling in just one interpreter’s sentence the operation we want.

Finally, it is possible to use Numba and Numpy at the same time to take advantage of both approaches:

from numba import jit
import numpy as np

@jit(nopython=True)
def traverse_and_compute(arr):
    return (np.where((arr % 2 == 0), (arr + 1) // 2, 0))

On one hand, Numba removes Python’s interpreter overhead by compiling the sentences to machine code, on the other, we will exploit the neat, fast and efficient linear algebra routines offered by Numpy.

Below you have a table comparing performances of all traverse_and_compute versions. An array of 1 million elements has been used.

Version	#instructions per iteration	IPC	Elapsed time [ms]	SpeedUp
Generic Python	12,650	2,58	1,228	1
Numba	22	3.41	1.62	758
Numpy	60	0.73	19.53	63
Numba & Numpy	41	2.36	7.19	171

The code ran with Python 3.8.5, Numpy 1.18.4 and Numba 0.51.2 on Intel(R) Core(TM) i5-8365U CPU. Elapsed time corresponds to the minimum measured value over 10 runs. SpeedUp is computed with elapsed time.

You can see that performance varies drastically from version to version, but in any case the code runs much faster when applying whatever optimization. Which version will give the best result will depend on each particular algorithm. In the presented best-practice, the code performs very trivial operations, therefore the best time is obtained when only using Numba. When using both (Numba & Numpy), we are adding extra instructions in invoking the Numpy function. Numpy would be the best choice for complex linear algebra operations where Numba wouldn’t use the best algorithm.

More optimization methods are possible apart from Numba or Numpy:

Writing a special C or Fortran kernel for this function and bind it to the main Python application.
Parallelizing the loop either using multprocessing or mpi4py packages.

These 2 methods have their own trade-offs in terms of performance gains versus maintainability and resource usage. The first method will solve the sequential performance, but in exchange of all Python advantages making the application more difficult to program and maintain. The second solution is not sustainable because one will try to hide the inefficient code by brute force (utilizing more hardware resources).

Recommended in program(s): Python loops (original) ·

Implemented in program(s): Python loops (numba+numpy) · Python loops (numba) · Python loops (numpy) ·

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 676553 (POP1) and 824080 (POP2).

Currently, the project receives funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 101143931 (POP3). The JU receives support from the European Union's Horizon Europe research and innovation programme and Spain, Germany, France, Portugal and the Czech Republic.