Lecture 10: Speeding up Python Functions with Vectorization and Cython
======================================================================

This lecture follows
Chapter 3 of *Sage for Power Users* by William Stein.

As an interpreted scripting language, Python is slow compared to
compiled languages such as C.  With Cython the compiled libraries
available in Sage can be used.  We may view Cython as a compiled
variant of Python.

The execution efficiency of Python is often very slow because variables
are interpreted as expressions and the default number types are often
not the data types supported by fast hardware arithmetic, but in slower
arithmetic implemented by software.  Cython is a variant of Python that
allows to add type declarations, so we may build much faster functions
without having to change the logic of the original script.
The running example is the application of basic numerical integration
to approximate \ :math:`\pi`.  
The fastest version applies numpy vectorization,
although this approach requires a reformulation of the original script.

A Motivating Example
--------------------

We consider the computation of a floating-point approximation of a sum.
Such sums occur in numerical integration.

One way to approximate \ :math:`\pi`/4 is to compute the area
of the unit circle defined by \ :math:`x^2 + y^2 - 1 = 0`,
for \ :math:`x` ranging from 0 to 1.  The corresponding \ :math:`y`
coordinate of a point on the unit circle is then \ :math:`\sqrt{1-x^2}`.


.. math::

   \frac{\pi}{4} = \int_0^1 \sqrt{1-x^2}
   \approx \frac{1}{n} \sum_{k=1}^n \sqrt{1 - \left(\frac{k}{n}\right)^2}

In Python, the formula translates into a one line of code,
defined in the function ``python_sum_symbolic``.

::

   def python_sum_symbolic(n):
       return float( sum(sqrt(1-(k/n)^2) for k in range(1, n+1)) )/n
   4*python_sum_symbolic(1000)

We see ``3.1395554669110264`` as an approximation for \ :math:`\pi`.

To benchmark the function, we can use ``timeit``.
The command ``timeit`` in Python is good to measure
the execution time of small code snippets.

::

   timeit('python_sum_symbolic(1000)')

The output on an 3.1 GHz Intel Core i7 processor is
``5 loops, best of 3: 84.8 ms per loop``.
For longer execution times, we better use cputime().

::

   t1 = cputime()
   python_sum_symbolic(10^4)
   ct1 = cputime(t1)
   print('time for python_sum_symbolic :', ct1)

and we see ``time for python_sum_symbolic : 2.694785``.

Executing in Pure Python
------------------------

The first reason that the function is so slow is because we use
the symbolic ``sqrt`` function. 
We will use the sqrt function of the ``math`` module in Python.
The explicit conversion to float is no longer needed
as this ``sqrt`` returns  afloating-point number.

::

    def python_sum(n):
        from math import sqrt
        return sum( sqrt(1-(k/n)^2) for k in range(1, n+1) )/n
    4*python_sum(1000)

Now we time the function again with ``time``.

::
 
    t2 = cputime()
    python_sum(10^4)
    ct2 = cputime(t2)
    print('time for python_sum :', ct2)

We obtain ``time for python_sum : 0.029598`` and because this has
now become a small computation, we may as well use ``timeit``.

::

    timeit('python_sum(10^4)')

which confirms with ``25 loops, best of 3: 23.3 ms per loop``
that that time has dropped from seconds to milliseconds.

::

    print('speedup :', ct1/ct2)

shows ``speedup : 91.0461855531``.

Vectorization with numpy
------------------------

Instead of the ``sqrt`` of pure Python, we could also use
the ``sqrt`` and ``sum``  of numpy.  
These numpy functions allow that we give arrays on input.

The *vectorization* of code is the replacement of a Python for loop
in ``for k in range(1, n+1)`` by a loop executed by numpy functions.

::

   def numpy_sum(n): 
       from numpy import sqrt, sum, arange
       x = arange(n)/float(n)
       return sum(sqrt(1-x**2))/n
   4*numpy_sum(1000)

Applying ``timeit`` to benchmark the function.

::

   timeit('numpy_sum(10^4)')

shows ``625 loops, best of 3: 108 µs per loop`` so the time
is now expressed in *microseconds* instead of milliseconds.

To compare against the pure Python sum, we sum one million items.

::

   t3 = cputime()
   python_sum(10^6)
   ct3 = cputime(t3)
   print('time for python_sum :', ct3)
   t4 = cputime()
   numpy_sum(10^6)
   ct4 = cputime(t4)
   print('time for numpy_sum :', ct4)

From the Python code we obtain ``time for python_sum : 2.696452``
and with numpy we get ``time for numpy_sum : 0.019973``.
Let us compute the speedup.

::

   print('speedup :', ct3/ct4)

which shows ``speedup : 135.004856556``. 
With numpy vectorization we again obtained a hundredfold improvement.

Cython code
-----------

The advantage of Cython over vectorization is that the Cython code
is almost identical as the Python code.  In this particular example
we had to replace the ``^`` by the ``**`` for the exponentiation.

::

    %%cython
    def cython_sum(n):
        from math import sqrt
        return sum( sqrt(1-(k/n)**2) for k in range(1, n+1) )/n

The two links returned when evaluating a cell with Cython code 
are the generated C code and an html file with the annotated version 
of the Cython program.  To see whether it works we do.

::

   print(4*cython_sum(1000))

and then we benchmark the code with ``timeit``.

::

   timeit('cython_sum(10^4)')

We obtain ``25 loops, best of 3: 26.2 ms per loop``.
Running a larger example with ``time``

::

   t5 = cputime()
   cython_sum(10^6)
   ct5 = cputime(t5)
   print('time for the cython_sum :', ct5)

shows ``time for the cython_sum : 2.651237`` so the Cython code is
not really faster than the original Python code,
because in both functions ``python_sum`` and ``cython_sum``
the same functions in the Python C library are executed.

The code can be made faster if we declare the variables
to be of C data types and the we use the C version of the sqrt function.

::

   %%cython
   cdef extern from "math.h":
       double sqrt(double)

   def cython_sum_typed(long n):
       cdef long k
       return sum( sqrt(1-(k/float(n))**2) for k in range(1, n+1) )/n

We first check the correctness.

::

   print(4*cython_sum_typed(1000))

and now we will see improved timing results.

::
 
   timeit('cython_sum_typed(10^4)')


With timeit we obtain ``625 loops, best of 3: 1.09 ms per loop``
and then we run ``time`` again.

::

   t6 = cputime()
   cython_sum_typed(10^6)
   ct6 = cputime(t6)
   print('time for cython_sum_typed :', ct6)

The output is ``time for cython_sum_typed : 0.112308``
and compared to the first Cython code, we compute the speedup.

::

   print('speedup :', ct5/ct6)

which prints ``speedup : 23.6068401182``.

Assignments
-----------

1. Perform all computations on your computer.
   Make a table with execution times and speedups.

2. Perform all computations on your Sage notebook account
   or on SageMathCloud.  Indicate which online version you ran.
   Make a table with execution times and speedups.

3. Write a Python function ``python_sum`` which takes on input 
   a positive integer \ :math:`n` and which returns the floating-point value of

   .. math::

      \frac{1}{n} \sum_{k=1}^{n} \ln \left( 1 + \frac{k}{n} \right).

   Write a more efficient version ``numpy_sum`` using vectorization.

   Time the two versions to illustrate the efficiency.

4. Take the Python function ``python_sum`` of the previous exercise
   and apply Cython to make the function more efficient.
   Time your Cython version and compare with ``python_sum``
   to illustrate the efficiency.