Lecture 39: Parallel Computing in Julia ======================================= All computers are parallel. We distinguish between three types of parallel computing. 1. Distributed memory parallel computing. 2. Shared memory parallel computing. 3. Acceleration with Graphics Processing Units. Parallel programs are evaluated by speedup and performance. 1. Speedup is the serial time over parallel time. 2. The number of floating-point operations per second (flops) measures the performance. Julia supports all the three types listed above, but in this lecture we focus on multithreading. Parallel Symbolic Computing --------------------------- In symbolic computing we compute expressions. We view the expressions as functions to evaluate at many points. The evaluation at different points happens independently, in parallel. Two of the most commonly used expressions are polynomials and power series. A typical application is interpolation. 1. A polynomial of degree \ :math:`d` in one variable has \ :math:`d+1` coefficients and is determined uniquely by \ :math:`d+1` function values at distinct points. 2. Power series are computed via the Fast Fourier Transform. Parallel Computing with Julia ----------------------------- We apply multithreading in a Jupyter notebook, in a kernel installed with the environment variable set to 16 threads. :: julia> using IJulia julia> installkernel("Julia (16 threads)", env = Dict("JULIA_NUM_THREADS"=>"16")) If the kernel was installed correctly, then in a code cell of a Jupyter notebook running this kernel, we can verify with :: using Base.Threads nthreads() that indeed we have ``16`` threads available. A quick way to do parallel programming is to call software libraries which support multithreading. We consider the matrix-matrix multiplication, as executed by ``mul!()`` of BLAS, where BLAS stands for the Basic Linear Algebra Subroutines. Two issues we must consider. 1. Choose the size of the matrices large enough. 2. The time should not include the compilation time. :: using LinearAlgebra n = 8000 The choice of ``8000`` as a sufficiently large problem can be justified by considering the ``peakflops``. For the matrix-matrix multiplication, we generate 3 random matrices :: A = rand(n, n); B = rand(n,n); C = rand(n,n); suppressing the output with the semicolon and set the number of threads to 2 with :: BLAS.set_num_threads(2) Then we run with a timer :: @time mul!(C, A, B) which reports also the percentage of time spent on compilation. To avoid this overhead, we run again and then double the number of threads and then run again: :: @time mul!(C, A, B) BLAS.set_num_threads(4) @time mul!(C, A, B) With four threads, the time dropped from 9.96 seconds to 6.46 seconds, done on an Intel i9-9880H 2.30GHz CPU in a laptop running Windows, with Julia 1.7.2. The command ``peakflops`` computes the peak flop rate using double precision ``gemm!``. For the computer used in this experiment, ``peakflops(8000)`` returned ``1.9e11`` (or 190 billion flops), which is more than what the ``peakflops(4000)`` or ``peakflops(16000)`` reported. So the dimension ``n = 8000`` is justified. Parallel Numerical Integration ------------------------------ We can estimate \ :math:`\pi`, via the area of the unit disk: .. math:: \displaystyle \int_0^1 \sqrt{1-x^2} dx = \frac{\pi}{4}. A Monte Carlo method proceeds as follows: 1. Generate random uniformly distributed points with coordinates \ :math:`(x,y) \in [0,+1] \times [0,+1]`. 2. We count a success when \ :math:`x^2 + y^2 \leq 1`. By the law of large numbers, the average of the observed successes converges to the expected value or mean, as the number of experiments increases. The random number we apply is very simple: :: myrand(x::Int64) = (1103515245x + 12345) % 2^31 and illustrates the easy at which Julia functions are defined. The function executed by each thread in the experiment is listed below: :: """ function estimatepi(n) Runs a simple Monte Carlo method to estimate pi with n samples. """ function estimatepi(n) r = threadid() count = 0 for i=1:n r = myrand(r) x = r/2^31 r = myrand(r) y = r/2^31 count += (x^2 + y^2) <= 1 end return 4*count/n end The ``r = thread(id)`` initializes the seed for the random number generator with the identification number of each thread. It is very important that every thread generates a different sequence of random numbers. Each thread writes in a different location in an array, defined next: :: nt = ntreads() estimates = zeros(nt) Then the parallel for loop below defines the execution: :: import Statistics timestart = time() @threads for i=1:nt estimates[i] = estimatepi(10_000_000_000/nt) end estpi = Statistics.mean(estimates) elapsed16 = time() - timestart The total number of 10 billion random points is divided by the number of threads. Running this same loop for a smaller number of threads allows to compute the speedup. The speedup of 16 over 8 threads was observed to be ``1.5``.