Lecture 10: Speeding up Python Functions with Vectorization and Cython¶
This lecture follows Chapter 3 of Sage for Power Users by William Stein.
As an interpreted scripting language, Python is slow compared to compiled languages such as C. With Cython the compiled libraries available in Sage can be used. We may view Cython as a compiled variant of Python.
The execution efficiency of Python is often very slow because variables are interpreted as expressions and the default number types are often not the data types supported by fast hardware arithmetic, but in slower arithmetic implemented by software. Cython is a variant of Python that allows to add type declarations, so we may build much faster functions without having to change the logic of the original script. The running example is the application of basic numerical integration to approximate \(\pi\). The fastest version applies numpy vectorization, although this approach requires a reformulation of the original script.
A Motivating Example¶
We consider the computation of a floating-point approximation of a sum. Such sums occur in numerical integration.
One way to approximate \(\pi\)/4 is to compute the area of the unit circle defined by \(x^2 + y^2 - 1 = 0\), for \(x\) ranging from 0 to 1. The corresponding \(y\) coordinate of a point on the unit circle is then \(\sqrt{1-x^2}\).
In Python, the formula translates into a one line of code,
defined in the function python_sum_symbolic
.
def python_sum_symbolic(n):
return float( sum(sqrt(1-(k/n)^2) for k in range(1, n+1)) )/n
4*python_sum_symbolic(1000)
We see 3.1395554669110264
as an approximation for \(\pi\).
To benchmark the function, we can use timeit
.
The command timeit
in Python is good to measure
the execution time of small code snippets.
timeit('python_sum_symbolic(1000)')
The output on an 3.1 GHz Intel Core i7 processor is
5 loops, best of 3: 84.8 ms per loop
.
For longer execution times, we better use cputime().
t1 = cputime()
python_sum_symbolic(10^4)
ct1 = cputime(t1)
print('time for python_sum_symbolic :', ct1)
and we see time for python_sum_symbolic : 2.694785
.
Executing in Pure Python¶
The first reason that the function is so slow is because we use
the symbolic sqrt
function.
We will use the sqrt function of the math
module in Python.
The explicit conversion to float is no longer needed
as this sqrt
returns afloating-point number.
def python_sum(n):
from math import sqrt
return sum( sqrt(1-(k/n)^2) for k in range(1, n+1) )/n
4*python_sum(1000)
Now we time the function again with time
.
t2 = cputime()
python_sum(10^4)
ct2 = cputime(t2)
print('time for python_sum :', ct2)
We obtain time for python_sum : 0.029598
and because this has
now become a small computation, we may as well use timeit
.
timeit('python_sum(10^4)')
which confirms with 25 loops, best of 3: 23.3 ms per loop
that that time has dropped from seconds to milliseconds.
print('speedup :', ct1/ct2)
shows speedup : 91.0461855531
.
Vectorization with numpy¶
Instead of the sqrt
of pure Python, we could also use
the sqrt
and sum
of numpy.
These numpy functions allow that we give arrays on input.
The vectorization of code is the replacement of a Python for loop
in for k in range(1, n+1)
by a loop executed by numpy functions.
def numpy_sum(n):
from numpy import sqrt, sum, arange
x = arange(n)/float(n)
return sum(sqrt(1-x**2))/n
4*numpy_sum(1000)
Applying timeit
to benchmark the function.
timeit('numpy_sum(10^4)')
shows 625 loops, best of 3: 108 µs per loop
so the time
is now expressed in microseconds instead of milliseconds.
To compare against the pure Python sum, we sum one million items.
t3 = cputime()
python_sum(10^6)
ct3 = cputime(t3)
print('time for python_sum :', ct3)
t4 = cputime()
numpy_sum(10^6)
ct4 = cputime(t4)
print('time for numpy_sum :', ct4)
From the Python code we obtain time for python_sum : 2.696452
and with numpy we get time for numpy_sum : 0.019973
.
Let us compute the speedup.
print('speedup :', ct3/ct4)
which shows speedup : 135.004856556
.
With numpy vectorization we again obtained a hundredfold improvement.
Cython code¶
The advantage of Cython over vectorization is that the Cython code
is almost identical as the Python code. In this particular example
we had to replace the ^
by the **
for the exponentiation.
%%cython
def cython_sum(n):
from math import sqrt
return sum( sqrt(1-(k/n)**2) for k in range(1, n+1) )/n
The two links returned when evaluating a cell with Cython code are the generated C code and an html file with the annotated version of the Cython program. To see whether it works we do.
print(4*cython_sum(1000))
and then we benchmark the code with timeit
.
timeit('cython_sum(10^4)')
We obtain 25 loops, best of 3: 26.2 ms per loop
.
Running a larger example with time
t5 = cputime()
cython_sum(10^6)
ct5 = cputime(t5)
print('time for the cython_sum :', ct5)
shows time for the cython_sum : 2.651237
so the Cython code is
not really faster than the original Python code,
because in both functions python_sum
and cython_sum
the same functions in the Python C library are executed.
The code can be made faster if we declare the variables to be of C data types and the we use the C version of the sqrt function.
%%cython
cdef extern from "math.h":
double sqrt(double)
def cython_sum_typed(long n):
cdef long k
return sum( sqrt(1-(k/float(n))**2) for k in range(1, n+1) )/n
We first check the correctness.
print(4*cython_sum_typed(1000))
and now we will see improved timing results.
timeit('cython_sum_typed(10^4)')
With timeit we obtain 625 loops, best of 3: 1.09 ms per loop
and then we run time
again.
t6 = cputime()
cython_sum_typed(10^6)
ct6 = cputime(t6)
print('time for cython_sum_typed :', ct6)
The output is time for cython_sum_typed : 0.112308
and compared to the first Cython code, we compute the speedup.
print('speedup :', ct5/ct6)
which prints speedup : 23.6068401182
.
Assignments¶
Perform all computations on your computer. Make a table with execution times and speedups.
Perform all computations on your Sage notebook account or on SageMathCloud. Indicate which online version you ran. Make a table with execution times and speedups.
Write a Python function
python_sum
which takes on input a positive integer \(n\) and which returns the floating-point value of\[\frac{1}{n} \sum_{k=1}^{n} \ln \left( 1 + \frac{k}{n} \right).\]Write a more efficient version
numpy_sum
using vectorization.Time the two versions to illustrate the efficiency.
Take the Python function
python_sum
of the previous exercise and apply Cython to make the function more efficient. Time your Cython version and compare withpython_sum
to illustrate the efficiency.