Lecture 10: Speeding up Python Functions with Vectorization and Cython

This lecture follows Chapter 3 of Sage for Power Users by William Stein.

As an interpreted scripting language, Python is slow compared to compiled languages such as C. With Cython the compiled libraries available in Sage can be used. We may view Cython as a compiled variant of Python.

The execution efficiency of Python is often very slow because variables are interpreted as expressions and the default number types are often not the data types supported by fast hardware arithmetic, but in slower arithmetic implemented by software. Cython is a variant of Python that allows to add type declarations, so we may build much faster functions without having to change the logic of the original script. The running example is the application of basic numerical integration to approximate \(\pi\). The fastest version applies numpy vectorization, although this approach requires a reformulation of the original script.

A Motivating Example

We consider the computation of a floating-point approximation of a sum. Such sums occur in numerical integration.

One way to approximate \(\pi\)/4 is to compute the area of the unit circle defined by \(x^2 + y^2 - 1 = 0\), for \(x\) ranging from 0 to 1. The corresponding \(y\) coordinate of a point on the unit circle is then \(\sqrt{1-x^2}\).

\[\frac{\pi}{4} = \int_0^1 \sqrt{1-x^2} \approx \frac{1}{n} \sum_{k=1}^n \sqrt{1 - \left(\frac{k}{n}\right)^2}\]

In Python, the formula translates into a one line of code, defined in the function python_sum_symbolic.

def python_sum_symbolic(n):
    return float( sum(sqrt(1-(k/n)^2) for k in range(1, n+1)) )/n
4*python_sum_symbolic(1000)

We see 3.1395554669110264 as an approximation for \(\pi\).

To benchmark the function, we can use timeit. The command timeit in Python is good to measure the execution time of small code snippets.

timeit('python_sum_symbolic(1000)')

The output on an 3.1 GHz Intel Core i7 processor is 5 loops, best of 3: 84.8 ms per loop. For longer execution times, we better use cputime().

t1 = cputime()
python_sum_symbolic(10^4)
ct1 = cputime(t1)
print('time for python_sum_symbolic :', ct1)

and we see time for python_sum_symbolic : 2.694785.

Executing in Pure Python

The first reason that the function is so slow is because we use the symbolic sqrt function. We will use the sqrt function of the math module in Python. The explicit conversion to float is no longer needed as this sqrt returns afloating-point number.

def python_sum(n):
    from math import sqrt
    return sum( sqrt(1-(k/n)^2) for k in range(1, n+1) )/n
4*python_sum(1000)

Now we time the function again with time.

t2 = cputime()
python_sum(10^4)
ct2 = cputime(t2)
print('time for python_sum :', ct2)

We obtain time for python_sum : 0.029598 and because this has now become a small computation, we may as well use timeit.

timeit('python_sum(10^4)')

which confirms with 25 loops, best of 3: 23.3 ms per loop that that time has dropped from seconds to milliseconds.

print('speedup :', ct1/ct2)

shows speedup : 91.0461855531.

Vectorization with numpy

Instead of the sqrt of pure Python, we could also use the sqrt and sum of numpy. These numpy functions allow that we give arrays on input.

The vectorization of code is the replacement of a Python for loop in for k in range(1, n+1) by a loop executed by numpy functions.

def numpy_sum(n):
    from numpy import sqrt, sum, arange
    x = arange(n)/float(n)
    return sum(sqrt(1-x**2))/n
4*numpy_sum(1000)

Applying timeit to benchmark the function.

timeit('numpy_sum(10^4)')

shows 625 loops, best of 3: 108 µs per loop so the time is now expressed in microseconds instead of milliseconds.

To compare against the pure Python sum, we sum one million items.

t3 = cputime()
python_sum(10^6)
ct3 = cputime(t3)
print('time for python_sum :', ct3)
t4 = cputime()
numpy_sum(10^6)
ct4 = cputime(t4)
print('time for numpy_sum :', ct4)

From the Python code we obtain time for python_sum : 2.696452 and with numpy we get time for numpy_sum : 0.019973. Let us compute the speedup.

print('speedup :', ct3/ct4)

which shows speedup : 135.004856556. With numpy vectorization we again obtained a hundredfold improvement.

Cython code

The advantage of Cython over vectorization is that the Cython code is almost identical as the Python code. In this particular example we had to replace the ^ by the ** for the exponentiation.

%%cython
def cython_sum(n):
    from math import sqrt
    return sum( sqrt(1-(k/n)**2) for k in range(1, n+1) )/n

The two links returned when evaluating a cell with Cython code are the generated C code and an html file with the annotated version of the Cython program. To see whether it works we do.

print(4*cython_sum(1000))

and then we benchmark the code with timeit.

timeit('cython_sum(10^4)')

We obtain 25 loops, best of 3: 26.2 ms per loop. Running a larger example with time

t5 = cputime()
cython_sum(10^6)
ct5 = cputime(t5)
print('time for the cython_sum :', ct5)

shows time for the cython_sum : 2.651237 so the Cython code is not really faster than the original Python code, because in both functions python_sum and cython_sum the same functions in the Python C library are executed.

The code can be made faster if we declare the variables to be of C data types and the we use the C version of the sqrt function.

%%cython
cdef extern from "math.h":
    double sqrt(double)

def cython_sum_typed(long n):
    cdef long k
    return sum( sqrt(1-(k/float(n))**2) for k in range(1, n+1) )/n

We first check the correctness.

print(4*cython_sum_typed(1000))

and now we will see improved timing results.

timeit('cython_sum_typed(10^4)')

With timeit we obtain 625 loops, best of 3: 1.09 ms per loop and then we run time again.

t6 = cputime()
cython_sum_typed(10^6)
ct6 = cputime(t6)
print('time for cython_sum_typed :', ct6)

The output is time for cython_sum_typed : 0.112308 and compared to the first Cython code, we compute the speedup.

print('speedup :', ct5/ct6)

which prints speedup : 23.6068401182.

Assignments

  1. Perform all computations on your computer. Make a table with execution times and speedups.

  2. Perform all computations on your Sage notebook account or on SageMathCloud. Indicate which online version you ran. Make a table with execution times and speedups.

  3. Write a Python function python_sum which takes on input a positive integer \(n\) and which returns the floating-point value of

    \[\frac{1}{n} \sum_{k=1}^{n} \ln \left( 1 + \frac{k}{n} \right).\]

    Write a more efficient version numpy_sum using vectorization.

    Time the two versions to illustrate the efficiency.

  4. Take the Python function python_sum of the previous exercise and apply Cython to make the function more efficient. Time your Cython version and compare with python_sum to illustrate the efficiency.