Tasking with Julia¶
Julia is a new programming language for scientific computing designed for performance. The tasking in Julia is inspired by parallel programming systems like Cilk, Intel Threading Building Blocks, and Go.
This lecture is based on a blogpost, of 23 July 2019,
https://julialang.org/blog/2019/07/multithreading
by Jeff Bezanson, Jameson Nash, and Kiran Pamnany,
as an early preview of Julia version 1.3.0.
Tasks are units of work, mapped to threads. The next sections mirror the previous lecture on Tasking with OpenMP.
Parallel Recursive Functions¶
The sequence of Fibonacci numbers \(F_n\) are defined as
This leads to a natural recursive function.
The recursion generates many function calls. While inefficient to compute \(F_n\), this recursion serves as a parallel pattern.
The Fibonacci function with tasking demonstrates the generation of a large number of tasks with one thread. No parallelism will result from this example.
But it is instructive to introduce basic task constructs.
With
t = @spawn F()
we start a taskt
to computeF()
, for some functionF()
.The
fetch(t)
waits fort
to complete and gets its return value.
In the multitasked program to compute the Fibonacci numbers, the number \(n\) for the \(n\)-th Fibonacci number will be gotten from the command line argument. The Julia program below prints all command line arguments.
print(PROGRAM_FILE, " has ", length(ARGS))
println(" arguments.")
println("The command line arguments :")
for x in ARGS
println(x)
end
If the file showthreads.jl
contains
using Base.Threads
nbt = nthreads()
println("The number of threads : ", nbt)
then run via typing
JULIA_NUM_THREADS=8 julia showthreads.jl
at the command prompt. Alternatively, type
julia -t 8 showthreads.jl
to run the program with 8 threads.
The recursive parallel computation of the Fibonacci numbers is then defined in the program below:
import Base.Threads.@spawn
function fib(n::Int)
if n < 2
return n
end
t = @spawn fib(n-2)
return fib(n-1) + fetch(t)
end
if length(ARGS) > 0
nbr = parse(Int64, ARGS[1])
println(fib(nbr))
else
println(fib(10))
end
If the program is saved in the file fibmt.jl
,
then we run it with 8 threads, typing
JULIA_NUM_THREADS=8 julia fibmt.jl 10
or alternatively
julia -t 8 fibmt.jl 10
at the command prompt to compute the 10-th Fibonacci number with tasks mapped to 8 threads.
The recursive function fib
illustrates the starting of a task
and the synchronization of the sibling task:
t = @spawn fib(n-2)
starts a task to computefib(n-2)
fetch(t)
waits fort
to complete and gets its return value.
There can not be any speedup because of the only computation,
the +
happens after the synchronization.
Parallel Recursive Quadrature¶
The recursive parallel computation of the Fibonacci number serves as a pattern to compute integrals by recursively dividing the integration interval. The setup is identical as the section on Parallel Recursive Quadrature in the Tasking with OpenMP lecture.
The recursive application of the composite Trapezoidal rule is defined in the Julia function rectraprule.
function rectraprule(level::Int64,depth::Int64,
f::Function,a::Float64,
b::Float64,n::Int64)
if level == depth
return traprule(f,a,b,n)
else
middle = (b-a)/2
t = @spawn rectraprule(level+1,depth, \
f,a,middle,n)
return rectraprule(level+1,depth, \
f,middle,b,n) + fetch(t)
end
end
Using a depth of recursion of 4, the output of a couple of runs is shown below:
$ time JULIA_NUM_THREADS=2 julia traprulerecmt.jl 4
1.7182818284590451e+00
1.7182818292271964e+00 error : 7.68e-10
real 0m5.207s
user 0m9.543s
sys 0m0.734s
$ time JULIA_NUM_THREADS=4 julia traprulerecmt.jl 4
1.7182818284590451e+00
1.7182818292271964e+00 error : 7.68e-10
real 0m3.120s
user 0m9.872s
sys 0m0.727s
$ time JULIA_NUM_THREADS=8 julia traprulerecmt.jl 4
1.7182818284590451e+00
1.7182818292271964e+00 error : 7.68e-10
real 0m1.985s
user 0m10.617s
sys 0m0.735s
$
Observe the decrease of the wall clock time as the number of threads doubles.
On a Windows computer, replace the time
by Measure-Command
,
and type
Measure-Command { julia -t 8 traprulerecmt.jl }
at the prompt in a PowerShell window.
Parallel Merge Sort¶
Merge sort works by divide and conquer, recursively as:
If no or one element, then return.
Split in two equal halves.
Sort the first half.
Sort the second half.
Merge the sorted halves.
The two above sort statements are recursive.
The sort algorithm will work in place, modifying the input,
without returning. Instead of fetch
, we use wait
.
The wait(t)
waits on task t
to finish.
The Julia function psort!
is defined below.
"""
Sorts the elements of v in place, from hi to lo.
"""
function psort!(v, lo::Int=1, hi::Int=length(v))
if lo >= hi
return v
end
if hi - lo < 100000 # no multithreading
sort!(view(v, lo:hi), alg = MergeSort)
return v
end
mid = (lo+hi)>>>1 # find the midpoint
# task to sort the first half starts
half = @spawn psort!(v, lo, mid)
# runs with the current call below
psort!(v, mid+1, hi)
# wait for the lower half to finish
wait(half)
temp = v[lo:mid] # workspace for merging
i, k, j = 1, lo, mid+1 # merge the two sorted sub-arrays
@inbounds while k < j <= hi
if v[j] < temp[i]
v[k] = v[j]
j += 1
else
v[k] = temp[i]
i += 1
end
k += 1
end
@inbounds while k < j
v[i] = temp[i]
k += 1
i += 1
end
return v
end
The @inbounds
skips the checking of the index bounds
when accessing array elements.
The main function of the program which times the Julia code properly
with @time
is listed below.
"""
Calls the psort! once before the timing
to avoid compilation overhead.
"""
function main()
a = rand(100)
b = copy(a)
psort!(b)
a = rand(20000000)
b = copy(a)
@time psort!(b)
end
Runs in a Bourne shell are listed below.
$ for n in 1 2 4 8; do JULIA_NUM_THREADS=$n julia mergesortmt.jl; done
2.219275 seconds (3.31 k allocations: 686.950 MiB, 3.34% gc time)
1.439491 seconds (3.59 k allocations: 686.959 MiB, 6.41% gc time)
0.920875 seconds (3.63 k allocations: 686.963 MiB, 3.90% gc time)
0.625733 seconds (3.73 k allocations: 686.969 MiB, 4.45% gc time)
$
Compare to the wall clock time:
$ time JULIA_NUM_THREADS=8 julia mergesortmt.jl
0.618549 seconds (3.72 k allocations: 686.969 MiB, 4.78% gc time)
real 0m1.220s
user 0m3.579s
sys 0m1.015s
$
Basic Linear Algebra Subprograms¶
The builtin LinearAlgebra
package of Julia offers access
to BLAS (Basic Linear Algebra Subroutines) which allow for
multithreaded computations. This section illustrates the
application of multithreading to solving linear algebra problems.
The inplace matrix matrix multiplication is provided via mul!
as illustrated in an interactive session below.
julia> using LinearAlgebra
julia> A=[1.0 2.0; 3.0 4.0]; B=[1.0 1.0; 1.0 1.0];
julia> C = similar(B); mul!(C, A, B)
2×2 Array{Float64,2}:
3.0 3.0
7.0 7.0
Basic Linear Algebra Subprograms (BLAS) specifies common elementary linear algebra operations.
help?> BLAS.set_num_threads
set_num_threads(n)
Set the number of threads the BLAS library should use.
Setting the number of threads provides a parallel matrix multiplication.
Consider the program matmatmulmt.jl
listed below.
using LinearAlgebra
if length(ARGS) < 2
println("use as")
print(" julia ", PROGRAM_FILE)
println(" dimension nthreads")
else
n = parse(Int, ARGS[1])
p = parse(Int, ARGS[2])
BLAS.set_num_threads(p)
A = rand(n, n)
B = rand(n, n)
C = similar(B)
@time mul!(C, A, B)
end
The output of runs is shown next:
$ julia matmatmulmt.jl 8000 1
20.823673 seconds (2.70 M allocations: 130.252 MiB)
$ julia matmatmulmt.jl 8000 2
11.338446 seconds (2.70 M allocations: 130.252 MiB)
$ julia matmatmulmt.jl 8000 4
6.242092 seconds (2.70 M allocations: 130.252 MiB)
$ julia matmatmulmt.jl 8000 8
3.853406 seconds (2.70 M allocations: 130.252 MiB)
$ julia matmatmulmt.jl 8000 16
2.487637 seconds (2.70 M allocations: 130.252 MiB)
$ julia matmatmulmt.jl 8000 32
1.864454 seconds (2.70 M allocations: 130.252 MiB)
$
The peak flops performance depends on the size of the problem.
The function peakflops
computes the peak flop rate of the computer
by using double precision gemm!
julia> using LinearAlgebra
julia> peakflops(8000)
3.331289611013868e11
julia> peakflops(16000)
3.475269847112081e11
julia> peakflops(4000)
3.130204729573054e11
The size of the problem needs to be large enough to fully occupy the available computing resources.
Exercises¶
Execute the recursive trapezoidal rule for different number of evaluations and increasing depths of recursion.
For which values do you observe the best speedups?
Run the
peakflops
on your computer.For which dimension do you see the highest value?
Compute the number of flops and relate this to the specifications of your computer.