Running Cython and Vectorization¶
While Python is a great language to prototype applications, its execution speed, efficiency, and performance may not be good enough for more demanding computational problems.
One way to improve the efficiency of a Python script is to extend Python with an extension module which encapsulates code in a compiled language. The computationally intensive portions of the application are then defined by compiled code, which runs more efficiently than interpreted code.
Below we discuss two alternatives to extension modules: cython and vectorization.
Getting Started with Cython¶
Cython is a programming language based on Python:
- with static type declarations to achieve the speed of C,
- to write optimized code and to interface with C libraries.
The proceedings of the 8th Python in Science Conference (SciPy 2009) introduce the language:
- S. Behnel, R.W. Bradshaw, D.S. Seljebotn: Cython tutorial. In SciPy 2009, pages 4-14, 2009.
- D.S. Seljebotn: Fast numerical computations with Cython. In SciPy 2009, pages 15-23, 2009.
Version 0.25.2 (2016-12-08) is available via {tt cython.org}. It installs with {tt pip3}. Demonstrations were done on a MacBook Pro. On Windows, Cython works well in cygwin.
Python is interpreted and dynamically typed. This implies that the instructions are parsed, translated, and executed one-by-one. The types of objects are determined during assignment. Cython code is compiled: a program is first parsed entirely, then translated into machine executable code, eventually optimized, before its execution. Static type declarations allow for translations into very efficient C code, and direct manipulations of objects in external libraries. Cython is a Python compiler: it compiles regular Python code.
We demonstrate the use of Cython on a hello world
example.
We can compile Cython code in two ways:
- using
distutils
to build an extension of Python; or - run the
cython
command-line utility to make a.c
file and then compile this file.
We illustrate the two ways with a simple say_hello
method.
Cython code has the extension .pyx
.
The content of the file hello.pyx
is listed below.
def say_hello(name):
"""
Prints hello followed by the name.
"""
print("hello", name)
To use distutils
we have to define a setup.py
file.
The file hello_setup.py
has content
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
EXT_MODULES = [Extension("hello", ["hello.pyx"])]
setup(
name = 'hello world' ,
cmdclass = {'build_ext': build_ext},
ext_modules = EXT_MODULES
)
To build the extension module, at the command prompt we type
$ python3 hello_setup.py build_ext --inplace
and this will make the shared object file
hello.cpython-36m-darwin.so
which we can rename
into hello.so
.
$ python3 hello_setup.py build_ext --inplace
running build_ext
cythoning hello.pyx to hello.c
building 'hello' extension
creating build
creating build/temp.macosx-10.6-intel-3.6
/usr/bin/clang -fno-strict-aliasing -Wsign-compare -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -arch i386 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m -c hello.c -o build/temp.macosx-10.6-intel-3.6/hello.o
/usr/bin/clang -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -g build/temp.macosx-10.6-intel-3.6/hello.o -o /Users/jan/Courses/MCS275/Spring17/Lec41/hello.cpython-36m-darwin.so
$
With the ---inplace
option, the shared object file
is placed in the current directory.
To test the extension module,
we import the function say_hello
of the module hello
.
$ python3
Python 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from hello import say_hello
>>> say_hello("there")
hello there
>>>
The second way to make the extension module is derived from the use of the setup tools. We copy the compilation statements into a makefile.
The first step is to call the cython compiler at the command prompt as follows.
$ which cython
/Library/Frameworks/Python.framework/Versions/3.6/bin/cython
$ cython hello.pyx
$ ls -lt hello.c
-rw-r--r-- 1 jan staff 77731 Apr 20 19:07 hello.c
$
The makefile then contains the entry
hello_cython:
cython hello.pyx
/usr/bin/clang -c hello.c -o hello.o \
-fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 \
-g -O2 -DNDEBUG -g -O3 \
-I/Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m
/usr/bin/clang -bundle -undefined dynamic_lookup -arch i386 \
-arch x86_64 -g hello.o -o hello.so
Typing hello_cython
executes cython
,
compiles the hello.c
,
and links the object hello.o
into the shared object hello.so
.
Numerical Integration¶
For a computational intensive, yet simple computation, consider
approximated with the composite Trapezoidal rule:
We let \(n = 10^7\) and make 10,000,000 square root function calls.
The formula above is implemented in the Python function integral4pi
listed below.
from math import sqrt # do not import in circle !!!
def circle(xvl):
"""
Returns the y corresponding to xvl
on the upper half of the unit circle.
"""
return sqrt(1-xvl**2)
def integral4pi(nbvals):
"""
Approximates Pi with the trapezoidal
rule with nbvals subintervals of [0,1].
"""
step = 1.0/nbvals
result = (circle(0)+circle(1))/2
for i in range(nbvals):
result += circle(i*step)
return 4*result*step
In the main()
we time the execution
with the clock
function of the time
module.
def main():
"""
Does the timing of integral4pi.
"""
from time import clock
start_time = clock()
approx = integral4pi(10**7)
stop_time = clock()
print 'pi =', approx
elapsed = stop_time - start_time
print 'elapsed time = %.3f seconds' % elapsed
Running this script on a 3.1 GHz Intel Core i7 MacBook Pro:
$ python3 integral4pi.py
pi = 3.1415930535527115
elapsed time = 3.464 seconds
$
We add type declaration in the script integral4pi_typed.pyx
.
from math import sqrt
def circle(double x):
return sqrt(1-x**2)
def integral4pi(int n):
cdef int i
cdef double h, r
h = 1.0/n
r = (circle(0)+circle(1))/2
for i in range(n):
r += circle(i*h)
return 4*r*h
We use distutils
to build the module integral4pi_typed
.
To build, we define integral4pi_typed_setup.py
:
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
ext_modules = [Extension("integral4pi_typed",
["integral4pi_typed.pyx"])]
setup(
name = 'integral approximation for pi' ,
cmdclass = {'build_ext': build_ext},
ext_modules = ext_modules
)
Building the code happens as follows:
$ python3 integral4pi_typed_setup.py build_ext --inplace
running build_ext
cythoning integral4pi_typed.pyx to integral4pi_typed.c
building 'integral4pi_typed' extension
/usr/bin/clang -fno-strict-aliasing -Wsign-compare -fno-common
-dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
-arch i386 -arch x86_64 -g
-I/Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m
-c integral4pi_typed.c
-o build/temp.macosx-10.6-intel-3.6/integral4pi_typed.o
/usr/bin/clang -bundle -undefined dynamic_lookup
-arch i386 -arch x86_64 -g
build/temp.macosx-10.6-intel-3.6/integral4pi_typed.o
-o /Users/jan/Courses/MCS275/Spring17/Lec41/integral4pi_typed.cpython-36m-darwin.so
$
The setup tools show us the compilation instructions.
The timing in integral4pi_type
of integral4pi
happens
as follows.
from time import clock
from integral4pi_typed import integral4pi
START_TIME = clock()
APPROX = integral4pi(10**7)
STOP_TIME = clock()
print('pi =', APPROX)
ELAPSED = STOP_TIME - START_TIME
print('elapsed time = %.3f seconds' % ELAPSED)
Running the script goes as follows.
$ python3 integral4pi_typed_apply.py
pi = 3.1415930535527115
elapsed time = 0.918 seconds
$
The code runs more than three times as fast as the original Python version (3.464 seconds).
To avoid the construction of float objects around function calls, we declare a C-style function:
from math import sqrt
cdef double circle(double x) except *:
return sqrt(1-x**2)
The rest of the script remains the same.
To compile integral4pi_cdeffun.pyx
, we define the
file integral4pi_cdeffun_setup.py
and build with
$ python3 integral4pi_cdeffun_setup.py build_ext --inplace
Similar as with integral4pi_typed_apply.py
we define the script integral4pi_cdeffun_apply.py
.
$ python3 integral4pi_cdeffun_apply.py
pi = 3.14159305355
elapsed time = 0.777 seconds
$
What have we achieved so far is summarized in Table 15.
elapsed seconds | speedup | |
---|---|---|
original Python | 3.464 | 1.00 |
Cython with cdef | 0.918 | 3.77 |
cdef function | 0.583 | 5.94 |
The main cost is calling sqrt
10,000,000 times...
Instead of using the sqrt
of the Python math
module,
we can directly use the sqrt
of the C math library:
cdef extern from "math.h":
double sqrt(double)
The rest of the script remains the same.
To compile integral4pi_extcfun.pyx
, we define the
file integral4pi_extcfun_setup.py
and build with
$ python3 integral4pi_extcfun_setup.py build_ext --inplace
Similar as with integral4pi_typed_apply.py
newline we define the script integral4pi_extcfun_apply.py
.
$ python3 integral4pi_extcfun_apply.py
pi = 3.1415930535527115
elapsed time = 0.041 seconds
$
This gives a nice speedup, summarized below in Table 16.
elapsed seconds | speedup | |
---|---|---|
original Python | 3.464 | 1.00 |
Cython with cdef | 0.918 | 3.77 |
cdef function | 0.583 | 5.94 |
external C function | 0.041 | 84.49 |
Finally, we compare against C, using the code below.
#include <stdio.h>
#include <math.h>
#include <time.h>
double circle ( double x )
{
return sqrt(1-x*x);
}
double integral4pi ( int n )
{
int i;
double h = 1.0/n;
double r = (circle(0)+circle(1))/2;
for(i=0; i<n; i++)
r += circle(i*h);
return 4*r*h;
}
The main function is below.
int main ( void )
{
int n = 10000000;
clock_t start_time,stop_time;
start_time = clock();
double a = integral4pi(n);
stop_time = clock();
printf("pi = %.15f\n",a);
printf("elapsed time = %.3f seconds\n",
(double) (stop_time-start_time)/CLOCKS_PER_SEC);
return 0;
}
The compiling and running of the code is shown below.
$ /usr/bin/clang -O3 integral4pi_native.c -o /tmp/integral4pi_native
$ /tmp/integral4pi_native
pi = 3.141593053552711
elapsed time = 0.023 seconds
$
The final table with the speedups is in Table 17.
elapsed seconds | speedup | |
---|---|---|
original Python | 3.464 | 1.00 |
Cython with cdef | 0.918 | 3.77 |
cdef function | 0.583 | 5.94 |
external C function | 0.041 | 84.49 |
native C code | 0.023 | 150.61 |
We achieved double digits speedups. The Cython code which calls the C sqrt function is faster than the native C code.