Lecture 39: Parallel Computing in Julia¶
All computers are parallel. We distinguish between three types of parallel computing.
Distributed memory parallel computing.
Shared memory parallel computing.
Acceleration with Graphics Processing Units.
Parallel programs are evaluated by speedup and performance.
Speedup is the serial time over parallel time.
The number of floating-point operations per second (flops) measures the performance.
Julia supports all the three types listed above, but in this lecture we focus on multithreading.
Parallel Symbolic Computing¶
In symbolic computing we compute expressions. We view the expressions as functions to evaluate at many points. The evaluation at different points happens independently, in parallel.
Two of the most commonly used expressions are polynomials and power series. A typical application is interpolation.
A polynomial of degree \(d\) in one variable has \(d+1\) coefficients and is determined uniquely by \(d+1\) function values at distinct points.
Power series are computed via the Fast Fourier Transform.
Parallel Computing with Julia¶
We apply multithreading in a Jupyter notebook,
in a kernel installed with the environment variable set to 16 threads.
julia> using IJulia
julia> installkernel("Julia (16 threads)", env = Dict("JULIA_NUM_THREADS"=>"16"))
If the kernel was installed correctly, then in a code cell of a Jupyter notebook running this kernel, we can verify with
using Base.Threads
nthreads()
that indeed we have 16
threads available.
A quick way to do parallel programming is to call software libraries
which support multithreading.
We consider the matrix-matrix multiplication,
as executed by mul!()
of BLAS,
where BLAS stands for the Basic Linear Algebra Subroutines.
Two issues we must consider.
Choose the size of the matrices large enough.
The time should not include the compilation time.
using LinearAlgebra
n = 8000
The choice of 8000
as a sufficiently large problem
can be justified by considering the peakflops
.
For the matrix-matrix multiplication, we generate 3 random matrices
A = rand(n, n); B = rand(n,n); C = rand(n,n);
suppressing the output with the semicolon and set the number of threads to 2 with
BLAS.set_num_threads(2)
Then we run with a timer
@time mul!(C, A, B)
which reports also the percentage of time spent on compilation. To avoid this overhead, we run again and then double the number of threads and then run again:
@time mul!(C, A, B)
BLAS.set_num_threads(4)
@time mul!(C, A, B)
With four threads, the time dropped from 9.96 seconds to 6.46 seconds, done on an Intel i9-9880H 2.30GHz CPU in a laptop running Windows, with Julia 1.7.2.
The command peakflops
computes the peak flop rate
using double precision gemm!
.
For the computer used in this experiment, peakflops(8000)
returned 1.9e11
(or 190 billion flops), which is more
than what the peakflops(4000)
or peakflops(16000)
reported.
So the dimension n = 8000
is justified.
Parallel Numerical Integration¶
We can estimate \(\pi\), via the area of the unit disk:
A Monte Carlo method proceeds as follows:
Generate random uniformly distributed points with coordinates \((x,y) \in [0,+1] \times [0,+1]\).
We count a success when \(x^2 + y^2 \leq 1\).
By the law of large numbers, the average of the observed successes converges to the expected value or mean, as the number of experiments increases.
The random number we apply is very simple:
myrand(x::Int64) = (1103515245x + 12345) % 2^31
and illustrates the easy at which Julia functions are defined. The function executed by each thread in the experiment is listed below:
"""
function estimatepi(n)
Runs a simple Monte Carlo method
to estimate pi with n samples.
"""
function estimatepi(n)
r = threadid()
count = 0
for i=1:n
r = myrand(r)
x = r/2^31
r = myrand(r)
y = r/2^31
count += (x^2 + y^2) <= 1
end
return 4*count/n
end
The r = thread(id)
initializes the seed for the random number generator
with the identification number of each thread. It is very important that
every thread generates a different sequence of random numbers.
Each thread writes in a different location in an array, defined next:
nt = ntreads()
estimates = zeros(nt)
Then the parallel for loop below defines the execution:
import Statistics
timestart = time()
@threads for i=1:nt
estimates[i] = estimatepi(10_000_000_000/nt)
end
estpi = Statistics.mean(estimates)
elapsed16 = time() - timestart
The total number of 10 billion random points is divided by the
number of threads. Running this same loop for a smaller number
of threads allows to compute the speedup.
The speedup of 16 over 8 threads was observed to be 1.5
.