Running Cython and Vectorization ================================ While Python is a great language to prototype applications, its execution speed, efficiency, and performance may not be good enough for more demanding computational problems. One way to improve the efficiency of a Python script is to extend Python with an extension module which encapsulates code in a compiled language. The computationally intensive portions of the application are then defined by compiled code, which runs more efficiently than interpreted code. Below we discuss two alternatives to extension modules: cython and vectorization. Getting Started with Cython --------------------------- Cython is a programming language based on Python: * with static type declarations to achieve the speed of C, * to write optimized code and to interface with C libraries. The proceedings of the 8th Python in Science Conference (SciPy 2009) introduce the language: * S. Behnel, R.W. Bradshaw, D.S. Seljebotn: **Cython tutorial.** In SciPy 2009, pages 4-14, 2009. * D.S. Seljebotn: **Fast numerical computations with Cython.** In SciPy 2009, pages 15-23, 2009. Version 0.25.2 (2016-12-08) is available via {\tt cython.org}. It installs with {\tt pip3}. Demonstrations were done on a MacBook Pro. On Windows, Cython works well in cygwin. Python is interpreted and dynamically typed. This implies that the instructions are parsed, translated, and executed one-by-one. The types of objects are determined during assignment. Cython code is compiled: a program is first parsed entirely, then translated into machine executable code, eventually optimized, before its execution. Static type declarations allow for translations into very efficient C code, and direct manipulations of objects in external libraries. Cython is a Python compiler: it compiles regular Python code. We demonstrate the use of Cython on a ``hello world`` example. We can compile Cython code in two ways: 1. using ``distutils`` to build an extension of Python; or 2. run the ``cython`` command-line utility to make a ``.c`` file and then compile this file. We illustrate the two ways with a simple ``say_hello`` method. Cython code has the extension ``.pyx``. The content of the file ``hello.pyx`` is listed below. :: def say_hello(name): """ Prints hello followed by the name. """ print("hello", name) To use ``distutils`` we have to define a ``setup.py`` file. The file ``hello_setup.py`` has content :: from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext EXT_MODULES = [Extension("hello", ["hello.pyx"])] setup( name = 'hello world' , cmdclass = {'build_ext': build_ext}, ext_modules = EXT_MODULES ) To build the extension module, at the command prompt we type :: $ python3 hello_setup.py build_ext --inplace and this will make the shared object file ``hello.cpython-36m-darwin.so`` which we can rename into ``hello.so``. :: $ python3 hello_setup.py build_ext --inplace running build_ext cythoning hello.pyx to hello.c building 'hello' extension creating build creating build/temp.macosx-10.6-intel-3.6 /usr/bin/clang -fno-strict-aliasing -Wsign-compare -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -arch i386 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m -c hello.c -o build/temp.macosx-10.6-intel-3.6/hello.o /usr/bin/clang -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -g build/temp.macosx-10.6-intel-3.6/hello.o -o /Users/jan/Courses/MCS275/Spring17/Lec41/hello.cpython-36m-darwin.so $ With the ``---inplace`` option, the shared object file is placed in the current directory. To test the extension module, we import the function ``say_hello`` of the module ``hello``. :: $ python3 Python 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from hello import say_hello >>> say_hello("there") hello there >>> The second way to make the extension module is derived from the use of the setup tools. We copy the compilation statements into a makefile. The first step is to call the cython compiler at the command prompt as follows. :: $ which cython /Library/Frameworks/Python.framework/Versions/3.6/bin/cython $ cython hello.pyx $ ls -lt hello.c -rw-r--r-- 1 jan staff 77731 Apr 20 19:07 hello.c $ The makefile then contains the entry :: hello_cython: cython hello.pyx /usr/bin/clang -c hello.c -o hello.o \ -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 \ -g -O2 -DNDEBUG -g -O3 \ -I/Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m /usr/bin/clang -bundle -undefined dynamic_lookup -arch i386 \ -arch x86_64 -g hello.o -o hello.so Typing ``hello_cython`` executes ``cython``, compiles the ``hello.c``, and links the object ``hello.o`` into the shared object ``hello.so``. Numerical Integration --------------------- For a computational intensive, yet simple computation, consider .. math:: \frac{\pi}{4} = \int_0^1 \sqrt{1-x^2} ~\! dx approximated with the composite Trapezoidal rule: .. math:: \int_0^1 \sqrt{1-x^2} ~\! dx \approx \frac{1}{n} \left( \frac{1}{2} + \sum_{i=1}^{n-1} \sqrt{1 - \left(\frac{i}{n} \right)^2} \right). We let :math:`n = 10^7` and make 10,000,000 square root function calls. The formula above is implemented in the Python function ``integral4pi`` listed below. :: from math import sqrt # do not import in circle !!! def circle(xvl): """ Returns the y corresponding to xvl on the upper half of the unit circle. """ return sqrt(1-xvl**2) def integral4pi(nbvals): """ Approximates Pi with the trapezoidal rule with nbvals subintervals of [0,1]. """ step = 1.0/nbvals result = (circle(0)+circle(1))/2 for i in range(nbvals): result += circle(i*step) return 4*result*step In the ``main()`` we time the execution with the ``clock`` function of the ``time`` module. :: def main(): """ Does the timing of integral4pi. """ from time import clock start_time = clock() approx = integral4pi(10**7) stop_time = clock() print 'pi =', approx elapsed = stop_time - start_time print 'elapsed time = %.3f seconds' % elapsed Running this script on a 3.1 GHz Intel Core i7 MacBook Pro: :: $ python3 integral4pi.py pi = 3.1415930535527115 elapsed time = 3.464 seconds $ We add type declaration in the script ``integral4pi_typed.pyx``. :: from math import sqrt def circle(double x): return sqrt(1-x**2) def integral4pi(int n): cdef int i cdef double h, r h = 1.0/n r = (circle(0)+circle(1))/2 for i in range(n): r += circle(i*h) return 4*r*h We use ``distutils`` to build the module ``integral4pi_typed``. To build, we define ``integral4pi_typed_setup.py``: :: from distutils.core import setup from distutils.extension import Extension from Cython.Distutils import build_ext ext_modules = [Extension("integral4pi_typed", ["integral4pi_typed.pyx"])] setup( name = 'integral approximation for pi' , cmdclass = {'build_ext': build_ext}, ext_modules = ext_modules ) Building the code happens as follows: :: $ python3 integral4pi_typed_setup.py build_ext --inplace running build_ext cythoning integral4pi_typed.pyx to integral4pi_typed.c building 'integral4pi_typed' extension /usr/bin/clang -fno-strict-aliasing -Wsign-compare -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -arch i386 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m -c integral4pi_typed.c -o build/temp.macosx-10.6-intel-3.6/integral4pi_typed.o /usr/bin/clang -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -g build/temp.macosx-10.6-intel-3.6/integral4pi_typed.o -o /Users/jan/Courses/MCS275/Spring17/Lec41/integral4pi_typed.cpython-36m-darwin.so $ The setup tools show us the compilation instructions. The timing in ``integral4pi_type`` of ``integral4pi`` happens as follows. :: from time import clock from integral4pi_typed import integral4pi START_TIME = clock() APPROX = integral4pi(10**7) STOP_TIME = clock() print('pi =', APPROX) ELAPSED = STOP_TIME - START_TIME print('elapsed time = %.3f seconds' % ELAPSED) Running the script goes as follows. :: $ python3 integral4pi_typed_apply.py pi = 3.1415930535527115 elapsed time = 0.918 seconds $ The code runs more than three times as fast as the original Python version (3.464 seconds). To avoid the construction of float objects around function calls, we declare a C-style function: :: from math import sqrt cdef double circle(double x) except *: return sqrt(1-x**2) The rest of the script remains the same. To compile ``integral4pi_cdeffun.pyx``, we define the file ``integral4pi_cdeffun_setup.py`` and build with :: $ python3 integral4pi_cdeffun_setup.py build_ext --inplace Similar as with ``integral4pi_typed_apply.py`` we define the script ``integral4pi_cdeffun_apply.py``. :: $ python3 integral4pi_cdeffun_apply.py pi = 3.14159305355 elapsed time = 0.777 seconds $ What have we achieved so far is summarized in :numref:`tabspeedups`. .. _tabspeedups: .. table:: Speedups compared to the original Python script +------------------+-----------------+---------+ | | elapsed seconds | speedup | +==================+=================+=========+ | original Python | 3.464 | 1.00 | +------------------+-----------------+---------+ | Cython with cdef | 0.918 | 3.77 | +------------------+-----------------+---------+ | cdef function | 0.583 | 5.94 | +------------------+-----------------+---------+ The main cost is calling ``sqrt`` 10,000,000 times... Instead of using the ``sqrt`` of the Python ``math`` module, we can directly use the ``sqrt`` of the C math library: :: cdef extern from "math.h": double sqrt(double) The rest of the script remains the same. To compile ``integral4pi_extcfun.pyx``, we define the file ``integral4pi_extcfun_setup.py`` and build with :: $ python3 integral4pi_extcfun_setup.py build_ext --inplace Similar as with ``integral4pi_typed_apply.py`` \newline we define the script ``integral4pi_extcfun_apply.py``. :: $ python3 integral4pi_extcfun_apply.py pi = 3.1415930535527115 elapsed time = 0.041 seconds $ This gives a nice speedup, summarized below in :numref:`tabspeedups2`. .. _tabspeedups2: .. table:: Speedups compared to the original Python script +---------------------+-----------------+---------+ | | elapsed seconds | speedup | +=====================+=================+=========+ | original Python | 3.464 | 1.00 | +---------------------+-----------------+---------+ | Cython with cdef | 0.918 | 3.77 | +---------------------+-----------------+---------+ | cdef function | 0.583 | 5.94 | +---------------------+-----------------+---------+ | external C function | 0.041 | 84.49 | +---------------------+-----------------+---------+ Finally, we compare against C, using the code below. :: #include #include #include double circle ( double x ) { return sqrt(1-x*x); } double integral4pi ( int n ) { int i; double h = 1.0/n; double r = (circle(0)+circle(1))/2; for(i=0; i