Lecture 39: Parallel Computing in Julia
=======================================

All computers are parallel.  
We distinguish between three types of parallel computing.

1. Distributed memory parallel computing.

2. Shared memory parallel computing.

3. Acceleration with Graphics Processing Units.

Parallel programs are evaluated by speedup and performance.

1. Speedup is the serial time over parallel time.

2. The number of floating-point operations per second (flops)
   measures the performance.

Julia supports all the three types listed above,
but in this lecture we focus on multithreading.

Parallel Symbolic Computing
---------------------------

In symbolic computing we compute expressions.
We view the expressions as functions to evaluate at many points.
The evaluation at different points happens independently, in parallel.

Two of the most commonly used expressions are polynomials
and power series.  A typical application is interpolation.

1. A polynomial of degree \ :math:`d` in one variable
   has \ :math:`d+1` coefficients and 
   is determined uniquely by \ :math:`d+1` function values at distinct points.

2. Power series are computed via the Fast Fourier Transform.

Parallel Computing with Julia
-----------------------------

We apply multithreading in a Jupyter notebook,

in a kernel installed with the environment variable set to 16 threads.

::

   julia> using IJulia
   julia> installkernel("Julia (16 threads)", env = Dict("JULIA_NUM_THREADS"=>"16"))

If the kernel was installed correctly, then in a code cell
of a Jupyter notebook running this kernel, we can verify with

::

   using Base.Threads
   nthreads()

that indeed we have ``16`` threads available.

A quick way to do parallel programming is to call software libraries
which support multithreading.  
We consider the matrix-matrix multiplication,
as executed by ``mul!()`` of BLAS,
where BLAS stands for the Basic Linear Algebra Subroutines.

Two issues we must consider.

1. Choose the size of the matrices large enough.
2. The time should not include the compilation time.

::

   using LinearAlgebra
   n = 8000

The choice of ``8000`` as a sufficiently large problem
can be justified by considering the ``peakflops``.

For the matrix-matrix multiplication, we generate 3 random matrices

::

    A = rand(n, n); B = rand(n,n); C = rand(n,n);

suppressing the output with the semicolon and set the number
of threads to 2 with

::

   BLAS.set_num_threads(2)

Then we run with a timer

::

   @time mul!(C, A, B)

which reports also the percentage of time spent on compilation.
To avoid this overhead, we run again and then double the number
of threads and then run again:

::

   @time mul!(C, A, B)
   BLAS.set_num_threads(4)
   @time mul!(C, A, B)

With four threads, the time dropped from 9.96 seconds
to 6.46 seconds, done on an Intel i9-9880H 2.30GHz CPU
in a laptop running Windows, with Julia 1.7.2.

The command ``peakflops`` computes the peak flop rate
using double precision ``gemm!``.  
For the computer used in this experiment, ``peakflops(8000)``
returned ``1.9e11`` (or 190 billion flops), which is more
than what the ``peakflops(4000)`` or ``peakflops(16000)`` reported.
So the dimension ``n = 8000`` is justified.

Parallel Numerical Integration
------------------------------

We can estimate \ :math:`\pi`, via the area of the unit disk:

.. math::

   \displaystyle \int_0^1 \sqrt{1-x^2} dx = \frac{\pi}{4}.

A Monte Carlo method proceeds as follows:

1. Generate random uniformly distributed points with coordinates 
   \ :math:`(x,y) \in [0,+1] \times [0,+1]`.

2. We count a success when \ :math:`x^2 + y^2 \leq 1`.

By the law of large numbers, the average of the observed
successes converges to the expected value or mean,
as the number of experiments increases.

The random number we apply is very simple:

::

   myrand(x::Int64) = (1103515245x + 12345) % 2^31

and illustrates the easy at which Julia functions are defined.
The function executed by each thread in the experiment is listed below:

::

    """
    function estimatepi(n)

    Runs a simple Monte Carlo method
    to estimate pi with n samples.
    """
    function estimatepi(n)
        r = threadid()
        count = 0
        for i=1:n
            r = myrand(r)
            x = r/2^31
            r = myrand(r)
            y = r/2^31
            count += (x^2 + y^2) <= 1
        end
        return 4*count/n
    end

The ``r = thread(id)`` initializes the seed for the random number generator
with the identification number of each thread.  It is very important that
every thread generates a different sequence of random numbers.
Each thread writes in a different location in an array, defined next:

::

   nt = ntreads()
   estimates = zeros(nt)

Then the parallel for loop below defines the execution:

::

   import Statistics
   timestart = time()
   @threads for i=1:nt
       estimates[i] = estimatepi(10_000_000_000/nt) 
   end
   estpi = Statistics.mean(estimates)
   elapsed16 = time() - timestart

The total number of 10 billion random points is divided by the
number of threads.  Running this same loop for a smaller number
of threads allows to compute the speedup.
The speedup of 16 over 8 threads was observed to be ``1.5``.