Running Cython and Vectorization

While Python is a great language to prototype applications, its execution speed, efficiency, and performance may not be good enough for more demanding computational problems.

One way to improve the efficiency of a Python script is to extend Python with an extension module which encapsulates code in a compiled language. The computationally intensive portions of the application are then defined by compiled code, which runs more efficiently than interpreted code.

Below we discuss two alternatives to extension modules: cython and vectorization.

Getting Started with Cython

Cython is a programming language based on Python:

  • with static type declarations to achieve the speed of C,
  • to write optimized code and to interface with C libraries.

The proceedings of the 8th Python in Science Conference (SciPy 2009) introduce the language:

  • S. Behnel, R.W. Bradshaw, D.S. Seljebotn: Cython tutorial. In SciPy 2009, pages 4-14, 2009.
  • D.S. Seljebotn: Fast numerical computations with Cython. In SciPy 2009, pages 15-23, 2009.

Version 0.25.2 (2016-12-08) is available via {tt cython.org}. It installs with {tt pip3}. Demonstrations were done on a MacBook Pro. On Windows, Cython works well in cygwin.

Python is interpreted and dynamically typed. This implies that the instructions are parsed, translated, and executed one-by-one. The types of objects are determined during assignment. Cython code is compiled: a program is first parsed entirely, then translated into machine executable code, eventually optimized, before its execution. Static type declarations allow for translations into very efficient C code, and direct manipulations of objects in external libraries. Cython is a Python compiler: it compiles regular Python code.

We demonstrate the use of Cython on a hello world example. We can compile Cython code in two ways:

  1. using distutils to build an extension of Python; or
  2. run the cython command-line utility to make a .c file and then compile this file.

We illustrate the two ways with a simple say_hello method. Cython code has the extension .pyx. The content of the file hello.pyx is listed below.

def say_hello(name):
    """
    Prints hello followed by the name.
    """
    print("hello", name)

To use distutils we have to define a setup.py file. The file hello_setup.py has content

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

EXT_MODULES = [Extension("hello", ["hello.pyx"])]

setup(
    name = 'hello world' ,
    cmdclass = {'build_ext': build_ext},
    ext_modules = EXT_MODULES
)

To build the extension module, at the command prompt we type

$ python3 hello_setup.py build_ext --inplace

and this will make the shared object file hello.cpython-36m-darwin.so which we can rename into hello.so.

$ python3 hello_setup.py build_ext --inplace
running build_ext
cythoning hello.pyx to hello.c
building 'hello' extension
creating build
creating build/temp.macosx-10.6-intel-3.6
/usr/bin/clang -fno-strict-aliasing -Wsign-compare -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -arch i386 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m -c hello.c -o build/temp.macosx-10.6-intel-3.6/hello.o
/usr/bin/clang -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -g build/temp.macosx-10.6-intel-3.6/hello.o -o /Users/jan/Courses/MCS275/Spring17/Lec41/hello.cpython-36m-darwin.so
$

With the ---inplace option, the shared object file is placed in the current directory.

To test the extension module, we import the function say_hello of the module hello.

$ python3
Python 3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from hello import say_hello
>>> say_hello("there")
hello there
>>>

The second way to make the extension module is derived from the use of the setup tools. We copy the compilation statements into a makefile.

The first step is to call the cython compiler at the command prompt as follows.

$ which cython
/Library/Frameworks/Python.framework/Versions/3.6/bin/cython
$ cython hello.pyx
$ ls -lt hello.c
-rw-r--r--  1 jan  staff  77731 Apr 20 19:07 hello.c
$

The makefile then contains the entry

hello_cython:
        cython hello.pyx
        /usr/bin/clang -c hello.c -o hello.o \
-fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 \
-g -O2 -DNDEBUG -g -O3 \
-I/Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m
        /usr/bin/clang -bundle -undefined dynamic_lookup -arch i386 \
-arch x86_64 -g hello.o -o hello.so

Typing hello_cython executes cython, compiles the hello.c, and links the object hello.o into the shared object hello.so.

Numerical Integration

For a computational intensive, yet simple computation, consider

\[\frac{\pi}{4} = \int_0^1 \sqrt{1-x^2} ~\! dx\]

approximated with the composite Trapezoidal rule:

\[\int_0^1 \sqrt{1-x^2} ~\! dx \approx \frac{1}{n} \left( \frac{1}{2} + \sum_{i=1}^{n-1} \sqrt{1 - \left(\frac{i}{n} \right)^2} \right).\]

We let \(n = 10^7\) and make 10,000,000 square root function calls. The formula above is implemented in the Python function integral4pi listed below.

from math import sqrt # do not import in circle !!!

def circle(xvl):
    """
    Returns the y corresponding to xvl
    on the upper half of the unit circle.
    """
    return sqrt(1-xvl**2)

def integral4pi(nbvals):
    """
    Approximates Pi with the trapezoidal
    rule with nbvals subintervals of [0,1].
    """
    step = 1.0/nbvals
    result = (circle(0)+circle(1))/2
    for i in range(nbvals):
        result += circle(i*step)
    return 4*result*step

In the main() we time the execution with the clock function of the time module.

def main():
    """
    Does the timing of integral4pi.
    """
    from time import clock
    start_time = clock()
    approx = integral4pi(10**7)
    stop_time = clock()
    print 'pi =', approx
    elapsed = stop_time - start_time
    print 'elapsed time = %.3f seconds' % elapsed

Running this script on a 3.1 GHz Intel Core i7 MacBook Pro:

$ python3 integral4pi.py
pi = 3.1415930535527115
elapsed time = 3.464 seconds
$

We add type declaration in the script integral4pi_typed.pyx.

from math import sqrt

def circle(double x):
    return sqrt(1-x**2)

def integral4pi(int n):
    cdef int i
    cdef double h, r
    h = 1.0/n
    r = (circle(0)+circle(1))/2
    for i in range(n):
        r += circle(i*h)
    return 4*r*h

We use distutils to build the module integral4pi_typed. To build, we define integral4pi_typed_setup.py:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules = [Extension("integral4pi_typed",
                        ["integral4pi_typed.pyx"])]

setup(
  name = 'integral approximation for pi' ,
  cmdclass = {'build_ext': build_ext},
  ext_modules = ext_modules
)

Building the code happens as follows:

$ python3 integral4pi_typed_setup.py build_ext --inplace
running build_ext
cythoning integral4pi_typed.pyx to integral4pi_typed.c
building 'integral4pi_typed' extension
/usr/bin/clang -fno-strict-aliasing -Wsign-compare -fno-common
-dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
-arch i386 -arch x86_64 -g
-I/Library/Frameworks/Python.framework/Versions/3.6/include/python3.6m
-c integral4pi_typed.c
-o build/temp.macosx-10.6-intel-3.6/integral4pi_typed.o
/usr/bin/clang -bundle -undefined dynamic_lookup
-arch i386 -arch x86_64 -g
build/temp.macosx-10.6-intel-3.6/integral4pi_typed.o
-o /Users/jan/Courses/MCS275/Spring17/Lec41/integral4pi_typed.cpython-36m-darwin.so
$

The setup tools show us the compilation instructions.

The timing in integral4pi_type of integral4pi happens as follows.

from time import clock
from integral4pi_typed import integral4pi

START_TIME = clock()
APPROX = integral4pi(10**7)
STOP_TIME = clock()
print('pi =', APPROX)
ELAPSED = STOP_TIME - START_TIME
print('elapsed time = %.3f seconds' % ELAPSED)

Running the script goes as follows.

$ python3 integral4pi_typed_apply.py
pi = 3.1415930535527115
elapsed time = 0.918 seconds
$

The code runs more than three times as fast as the original Python version (3.464 seconds).

To avoid the construction of float objects around function calls, we declare a C-style function:

from math import sqrt

cdef double circle(double x) except *:
    return sqrt(1-x**2)

The rest of the script remains the same. To compile integral4pi_cdeffun.pyx, we define the file integral4pi_cdeffun_setup.py and build with

$ python3 integral4pi_cdeffun_setup.py build_ext --inplace

Similar as with integral4pi_typed_apply.py we define the script integral4pi_cdeffun_apply.py.

$ python3 integral4pi_cdeffun_apply.py
pi = 3.14159305355
elapsed time = 0.777 seconds
$

What have we achieved so far is summarized in Table 15.

Table 15 Speedups compared to the original Python script
  elapsed seconds speedup
original Python 3.464 1.00
Cython with cdef 0.918 3.77
cdef function 0.583 5.94

The main cost is calling sqrt 10,000,000 times... Instead of using the sqrt of the Python math module, we can directly use the sqrt of the C math library:

cdef extern from "math.h":
    double sqrt(double)

The rest of the script remains the same. To compile integral4pi_extcfun.pyx, we define the file integral4pi_extcfun_setup.py and build with

$ python3 integral4pi_extcfun_setup.py build_ext --inplace

Similar as with integral4pi_typed_apply.py newline we define the script integral4pi_extcfun_apply.py.

$ python3 integral4pi_extcfun_apply.py
pi = 3.1415930535527115
elapsed time = 0.041 seconds
$

This gives a nice speedup, summarized below in Table 16.

Table 16 Speedups compared to the original Python script
  elapsed seconds speedup
original Python 3.464 1.00
Cython with cdef 0.918 3.77
cdef function 0.583 5.94
external C function 0.041 84.49

Finally, we compare against C, using the code below.

#include <stdio.h>
#include <math.h>
#include <time.h>

double circle ( double x )
{
   return sqrt(1-x*x);
}

double integral4pi ( int n )
{
   int i;
   double h = 1.0/n;
   double r = (circle(0)+circle(1))/2;

   for(i=0; i<n; i++)
      r += circle(i*h);
   return 4*r*h;
}

The main function is below.

int main ( void )
{
   int n = 10000000;
   clock_t start_time,stop_time;

   start_time = clock();
   double a = integral4pi(n);
   stop_time = clock();

   printf("pi = %.15f\n",a);
   printf("elapsed time = %.3f seconds\n",
          (double) (stop_time-start_time)/CLOCKS_PER_SEC);

   return 0;
}

The compiling and running of the code is shown below.

$ /usr/bin/clang -O3 integral4pi_native.c -o /tmp/integral4pi_native
$ /tmp/integral4pi_native
pi = 3.141593053552711
elapsed time = 0.023 seconds
$

The final table with the speedups is in Table 17.

Table 17 Speedups compared to the original Python script
  elapsed seconds speedup
original Python 3.464 1.00
Cython with cdef 0.918 3.77
cdef function 0.583 5.94
external C function 0.041 84.49
native C code 0.023 150.61

We achieved double digits speedups. The Cython code which calls the C sqrt function is faster than the native C code.