Tasking with Julia

Julia is a new programming language for scientific computing designed for performance. The tasking in Julia is inspired by parallel programming systems like Cilk, Intel Threading Building Blocks, and Go.

This lecture is based on a blogpost, of 23 July 2019, https://julialang.org/blog/2019/07/multithreading by Jeff Bezanson, Jameson Nash, and Kiran Pamnany, as an early preview of Julia version 1.3.0.

Tasks are units of work, mapped to threads. The next sections mirror the previous lecture on Tasking with OpenMP.

Parallel Recursive Functions

The sequence of Fibonacci numbers \(F_n\) are defined as

\[F_0 = 0, \quad F_1 = 1, \quad \mbox{and for } n>1: F_n = F_{n-1} + F_{n-2}.\]

This leads to a natural recursive function.

The recursion generates many function calls. While inefficient to compute \(F_n\), this recursion serves as a parallel pattern.

The Fibonacci function with tasking demonstrates the generation of a large number of tasks with one thread. No parallelism will result from this example.

But it is instructive to introduce basic task constructs.

  • With t = @spawn F() we start a task t to compute F(), for some function F().

  • The fetch(t) waits for t to complete and gets its return value.

In the multitasked program to compute the Fibonacci numbers, the number \(n\) for the \(n\)-th Fibonacci number will be gotten from the command line argument. The Julia program below prints all command line arguments.

print(PROGRAM_FILE, " has ", length(ARGS))
println(" arguments.")
println("The command line arguments :")
for x in ARGS
    println(x)
end

If the file showthreads.jl contains

using Base.Threads

nbt = nthreads()
println("The number of threads : ", nbt)

then run via typing

JULIA_NUM_THREADS=8 julia showthreads.jl

at the command prompt. Alternatively, type

julia -t 8 showthreads.jl

to run the program with 8 threads.

The recursive parallel computation of the Fibonacci numbers is then defined in the program below:

import Base.Threads.@spawn

function fib(n::Int)
    if n < 2
        return n
    end
    t = @spawn fib(n-2)
    return fib(n-1) + fetch(t)
end

if length(ARGS) > 0
    nbr = parse(Int64, ARGS[1])
    println(fib(nbr))
else
    println(fib(10))
end

If the program is saved in the file fibmt.jl, then we run it with 8 threads, typing

JULIA_NUM_THREADS=8 julia fibmt.jl 10

or alternatively

julia -t 8 fibmt.jl 10

at the command prompt to compute the 10-th Fibonacci number with tasks mapped to 8 threads.

The recursive function fib illustrates the starting of a task and the synchronization of the sibling task:

  • t = @spawn fib(n-2) starts a task to compute fib(n-2)

  • fetch(t) waits for t to complete and gets its return value.

There can not be any speedup because of the only computation, the + happens after the synchronization.

Parallel Recursive Quadrature

The recursive parallel computation of the Fibonacci number serves as a pattern to compute integrals by recursively dividing the integration interval. The setup is identical as the section on Parallel Recursive Quadrature in the Tasking with OpenMP lecture.

The recursive application of the composite Trapezoidal rule is defined in the Julia function rectraprule.

function rectraprule(level::Int64,depth::Int64,
                     f::Function,a::Float64,
                  b::Float64,n::Int64)
    if level == depth
        return traprule(f,a,b,n)
    else
        middle = (b-a)/2

        t = @spawn rectraprule(level+1,depth, \
                               f,a,middle,n)
        return rectraprule(level+1,depth, \
                           f,middle,b,n) + fetch(t)
    end
end

Using a depth of recursion of 4, the output of a couple of runs is shown below:

$ time JULIA_NUM_THREADS=2 julia traprulerecmt.jl 4
1.7182818284590451e+00
1.7182818292271964e+00   error : 7.68e-10

real    0m5.207s
user    0m9.543s
sys     0m0.734s
$ time JULIA_NUM_THREADS=4 julia traprulerecmt.jl 4
1.7182818284590451e+00
1.7182818292271964e+00   error : 7.68e-10

real    0m3.120s
user    0m9.872s
sys     0m0.727s
$ time JULIA_NUM_THREADS=8 julia traprulerecmt.jl 4
1.7182818284590451e+00
1.7182818292271964e+00   error : 7.68e-10

real    0m1.985s
user    0m10.617s
sys     0m0.735s
$

Observe the decrease of the wall clock time as the number of threads doubles.

On a Windows computer, replace the time by Measure-Command, and type

Measure-Command { julia -t 8 traprulerecmt.jl }

at the prompt in a PowerShell window.

Parallel Merge Sort

Merge sort works by divide and conquer, recursively as:

  1. If no or one element, then return.

  2. Split in two equal halves.

  3. Sort the first half.

  4. Sort the second half.

  5. Merge the sorted halves.

The two above sort statements are recursive.

The sort algorithm will work in place, modifying the input, without returning. Instead of fetch, we use wait. The wait(t) waits on task t to finish.

The Julia function psort! is defined below.

"""
Sorts the elements of v in place, from hi to lo.
"""
function psort!(v, lo::Int=1, hi::Int=length(v))
    if lo >= hi
        return v
    end
    if hi - lo < 100000         # no multithreading
        sort!(view(v, lo:hi), alg = MergeSort)
        return v
    end

    mid = (lo+hi)>>>1           # find the midpoint

    # task to sort the first half starts
    half = @spawn psort!(v, lo, mid)

    # runs with the current call below
    psort!(v, mid+1, hi)

    # wait for the lower half to finish
    wait(half)

    temp = v[lo:mid]            # workspace for merging
    i, k, j = 1, lo, mid+1      # merge the two sorted sub-arrays

    @inbounds while k < j <= hi
         if v[j] < temp[i]
             v[k] = v[j]
             j += 1
         else
             v[k] = temp[i]
             i += 1
         end
         k += 1
    end
    @inbounds while k < j
         v[i] = temp[i]
         k += 1
         i += 1
    end

    return v
 end

The @inbounds skips the checking of the index bounds when accessing array elements.

The main function of the program which times the Julia code properly with @time is listed below.

"""
Calls the psort! once before the timing
to avoid compilation overhead.
"""
function main()
    a = rand(100)
    b = copy(a)
    psort!(b)
    a = rand(20000000)
    b = copy(a)
    @time psort!(b)
end

Runs in a Bourne shell are listed below.

$ for n in 1 2 4 8; do JULIA_NUM_THREADS=$n julia mergesortmt.jl; done
  2.219275 seconds (3.31 k allocations: 686.950 MiB, 3.34% gc time)
  1.439491 seconds (3.59 k allocations: 686.959 MiB, 6.41% gc time)
  0.920875 seconds (3.63 k allocations: 686.963 MiB, 3.90% gc time)
  0.625733 seconds (3.73 k allocations: 686.969 MiB, 4.45% gc time)
$

Compare to the wall clock time:

$ time JULIA_NUM_THREADS=8 julia mergesortmt.jl
  0.618549 seconds (3.72 k allocations: 686.969 MiB, 4.78% gc time)

real    0m1.220s
user    0m3.579s
sys     0m1.015s
$

Basic Linear Algebra Subprograms

The builtin LinearAlgebra package of Julia offers access to BLAS (Basic Linear Algebra Subroutines) which allow for multithreaded computations. This section illustrates the application of multithreading to solving linear algebra problems.

The inplace matrix matrix multiplication is provided via mul! as illustrated in an interactive session below.

julia> using LinearAlgebra

julia> A=[1.0 2.0; 3.0 4.0]; B=[1.0 1.0; 1.0 1.0];

julia> C = similar(B); mul!(C, A, B)
2×2 Array{Float64,2}:
 3.0  3.0
 7.0  7.0

Basic Linear Algebra Subprograms (BLAS) specifies common elementary linear algebra operations.

help?> BLAS.set_num_threads
  set_num_threads(n)

  Set the number of threads the BLAS library should use.

Setting the number of threads provides a parallel matrix multiplication. Consider the program matmatmulmt.jl listed below.

using LinearAlgebra

if length(ARGS) < 2
    println("use as")
    print("        julia ", PROGRAM_FILE)
    println(" dimension nthreads")
else
    n = parse(Int, ARGS[1])
    p = parse(Int, ARGS[2])

    BLAS.set_num_threads(p)
    A = rand(n, n)
    B = rand(n, n)
    C = similar(B)
    @time mul!(C, A, B)
end

The output of runs is shown next:

$ julia matmatmulmt.jl 8000 1
 20.823673 seconds (2.70 M allocations: 130.252 MiB)

$ julia matmatmulmt.jl 8000 2
 11.338446 seconds (2.70 M allocations: 130.252 MiB)

$ julia matmatmulmt.jl 8000 4
  6.242092 seconds (2.70 M allocations: 130.252 MiB)

$ julia matmatmulmt.jl 8000 8
  3.853406 seconds (2.70 M allocations: 130.252 MiB)

$ julia matmatmulmt.jl 8000 16
  2.487637 seconds (2.70 M allocations: 130.252 MiB)

$ julia matmatmulmt.jl 8000 32
  1.864454 seconds (2.70 M allocations: 130.252 MiB)
$

The peak flops performance depends on the size of the problem. The function peakflops computes the peak flop rate of the computer by using double precision gemm!

julia> using LinearAlgebra

julia> peakflops(8000)
3.331289611013868e11

julia> peakflops(16000)
3.475269847112081e11

julia> peakflops(4000)
3.130204729573054e11

The size of the problem needs to be large enough to fully occupy the available computing resources.

Exercises

  1. Execute the recursive trapezoidal rule for different number of evaluations and increasing depths of recursion.

    For which values do you observe the best speedups?

  2. Run the peakflops on your computer.

    For which dimension do you see the highest value?

    Compute the number of flops and relate this to the specifications of your computer.