Lecture 39: Parallel Computing in Julia

All computers are parallel. We distinguish between three types of parallel computing.

  1. Distributed memory parallel computing.

  2. Shared memory parallel computing.

  3. Acceleration with Graphics Processing Units.

Parallel programs are evaluated by speedup and performance.

  1. Speedup is the serial time over parallel time.

  2. The number of floating-point operations per second (flops) measures the performance.

Julia supports all the three types listed above, but in this lecture we focus on multithreading.

Parallel Symbolic Computing

In symbolic computing we compute expressions. We view the expressions as functions to evaluate at many points. The evaluation at different points happens independently, in parallel.

Two of the most commonly used expressions are polynomials and power series. A typical application is interpolation.

  1. A polynomial of degree \(d\) in one variable has \(d+1\) coefficients and is determined uniquely by \(d+1\) function values at distinct points.

  2. Power series are computed via the Fast Fourier Transform.

Parallel Computing with Julia

We apply multithreading in a Jupyter notebook,

in a kernel installed with the environment variable set to 16 threads.

julia> using IJulia
julia> installkernel("Julia (16 threads)", env = Dict("JULIA_NUM_THREADS"=>"16"))

If the kernel was installed correctly, then in a code cell of a Jupyter notebook running this kernel, we can verify with

using Base.Threads
nthreads()

that indeed we have 16 threads available.

A quick way to do parallel programming is to call software libraries which support multithreading. We consider the matrix-matrix multiplication, as executed by mul!() of BLAS, where BLAS stands for the Basic Linear Algebra Subroutines.

Two issues we must consider.

  1. Choose the size of the matrices large enough.

  2. The time should not include the compilation time.

using LinearAlgebra
n = 8000

The choice of 8000 as a sufficiently large problem can be justified by considering the peakflops.

For the matrix-matrix multiplication, we generate 3 random matrices

A = rand(n, n); B = rand(n,n); C = rand(n,n);

suppressing the output with the semicolon and set the number of threads to 2 with

BLAS.set_num_threads(2)

Then we run with a timer

@time mul!(C, A, B)

which reports also the percentage of time spent on compilation. To avoid this overhead, we run again and then double the number of threads and then run again:

@time mul!(C, A, B)
BLAS.set_num_threads(4)
@time mul!(C, A, B)

With four threads, the time dropped from 9.96 seconds to 6.46 seconds, done on an Intel i9-9880H 2.30GHz CPU in a laptop running Windows, with Julia 1.7.2.

The command peakflops computes the peak flop rate using double precision gemm!. For the computer used in this experiment, peakflops(8000) returned 1.9e11 (or 190 billion flops), which is more than what the peakflops(4000) or peakflops(16000) reported. So the dimension n = 8000 is justified.

Parallel Numerical Integration

We can estimate \(\pi\), via the area of the unit disk:

\[\displaystyle \int_0^1 \sqrt{1-x^2} dx = \frac{\pi}{4}.\]

A Monte Carlo method proceeds as follows:

  1. Generate random uniformly distributed points with coordinates \((x,y) \in [0,+1] \times [0,+1]\).

  2. We count a success when \(x^2 + y^2 \leq 1\).

By the law of large numbers, the average of the observed successes converges to the expected value or mean, as the number of experiments increases.

The random number we apply is very simple:

myrand(x::Int64) = (1103515245x + 12345) % 2^31

and illustrates the easy at which Julia functions are defined. The function executed by each thread in the experiment is listed below:

"""
function estimatepi(n)

Runs a simple Monte Carlo method
to estimate pi with n samples.
"""
function estimatepi(n)
    r = threadid()
    count = 0
    for i=1:n
        r = myrand(r)
        x = r/2^31
        r = myrand(r)
        y = r/2^31
        count += (x^2 + y^2) <= 1
    end
    return 4*count/n
end

The r = thread(id) initializes the seed for the random number generator with the identification number of each thread. It is very important that every thread generates a different sequence of random numbers. Each thread writes in a different location in an array, defined next:

nt = ntreads()
estimates = zeros(nt)

Then the parallel for loop below defines the execution:

import Statistics
timestart = time()
@threads for i=1:nt
    estimates[i] = estimatepi(10_000_000_000/nt)
end
estpi = Statistics.mean(estimates)
elapsed16 = time() - timestart

The total number of 10 billion random points is divided by the number of threads. Running this same loop for a smaller number of threads allows to compute the speedup. The speedup of 16 over 8 threads was observed to be 1.5.