MCS 572: Introduction to Supercomputing
The goal of the course is to study parallel algorithms and
their implementation on supercomputers.
The evaluation of the course consists mainly of homework and
computer projects. Exams are scheduled as preparation for the
computational science prelim.
The recommended (not required) textbooks are
- "Parallel Programming. Techniques
and Applications Using Networked Workstations and Parallel
Computers"
by Barry Wilkinson and Michael Allen, Pearson Prentice Hall,
second edition, 2005.
- "Programming Massively Parallel Processors.
A Hands-on Approach"
by David B. Kirk and Wen-mei W. Hwu,
Elsevier/Morgan Kaufmann Publishers, 2010; fourth edition, 2023,
with Izzat El Hajj as third author.
Lecture notes are available
The lecture notes are still a work in progress...
Slides for the lectures will be posted below.
0. Introduction
1. Distributed Memory Parallel Computing
2. Shared Memory Parallel Computing
- L-10 09/18/24: Introduction to OpenMP.
To write parallel programs for shared memory computers
we can use OpenMP.
Slides.
A Julia program:
mtcomptrap.jl.
Simple C programs:
hello_openmp0.c,
hello_openmp1.c,
hello_openmp2.c,
comptrap.c, and
comptrap_omp.c.
Prerecorded lecture:
youtube link.
- L-11 09/20/24: The Work Crew Model.
We define dynamic work load assignments for parallel
shared memory computers, using OpenMP, followed by
a brief introduction to pthreads and Julia.
Slides.
Simple C programs:
jobqueue.h,
jobqueue.c,
test_jobqueue.c,
run_jobqueue_omp.c,
hello_pthreads.c, and
process_jobqueue.c.
Some Julia programs:
hellothreads.jl and
workcrew.jl
Prerecorded lecture:
youtube link.
- L-12 09/23/24: Tasking with OpenMP.
Recursive functions can run in parallel.
Bernstein's conditions are applied for a dependency analysis.
We end with a parallel blocked matrix computation.
Slides.
Simple C programs:
fib.c,
fibomp.c,
comptraprec.c,
comptraprecomp.c, and
matmulomp.c.
Prerecorded lecture:
youtube link.
- L-13 09/25/24: Tasking with Julia.
We redo the last lecture, but now with Julia.
Slides.
A slurm script to launch a mpi job:
mpi_hello_world.slurm
Simple Julia programs:
showargs.jl,
showthreads.jl,
fibmt.jl,
traprulerec.jl,
traprulerecmt.jl,
mergesortmt.jl, and
matmatmulmt.jl.
Prerecorded lecture:
youtube link.
- L-14 09/27/24: Evaluating Parallel Performance.
We look at metrics, isoefficiency, critical path analysis,
and the roofline model
to evaluate the performance of parallel algorithms.
Slides.
Prerecorded lecture:
youtube link.
- L-15 09/30/24: Work Stealing.
For job assignment, we can view work stealing as a hybrid between
static and dynamic work assignment, as defined by a Julia program.
The Threading Building Blocks provide a high level programming model
for C++ programmers.
Slides.
A simple Julia program:
worksteal.jl
A Python script with Numba:
montecarlo4pi.py
A Python script with Parsl:
parslmapreduce.py
C++ programs:
hello_task_group.cpp,
powers_serial.cpp,
powers_tbb.cpp, and
parsum_tbb.cpp
Prerecorded lecture:
youtube link.
3. Acceleration with Graphics Processing Units
4. Pipelining and Synchronized Computations
- L-25 10/23/24: Pipelined Computations.
As in the manufactoring of cars, arranging the stages in a computation
along a pipeline speeds up the computations.
Slides.
A Julia program:
leibnizunrolled.jl.
Some simple C programs:
pipe_ring.c and
pipe_sum.c.
Prerecorded lecture:
youtube link.
- L-26 10/25/24: Pipelined Sorting and Sieving.
We illustrate pipelined sorting algorithms with MPI.
Type 2 pipelines are introduced with sieving for primes.
Type 3 pipelines are illustrated by triangular linear systems.
Slides.
One simple C program:
pipe_sort.c
Prerecorded lecture:
youtube link.
- L-27 10/28/24: Solving Triangular Systems.
We consider the solving of triangular linear systems,
using first a high level OpenMP implementation and
then a GPU accelerated solver with CUDA, both in multiple double precision.
Slides.
Some simple C++ programs:
qd4sqrt2.cpp,
triangularsolve.cpp,
triangularsolve_qd.cpp (with QD), and
triangularsolve_qd_omp.cpp
(with QD and OpenMP).
Prerecorded lecture:
youtube link.
- L-28 10/30/24: Barriers for Synchronizations.
For message passing, we distinguish between a linear, a tree, and
a butterfly barrier. We end with the PRAM model and Brent's theorem.
Slides.
Two simple C programs with MPI:
use_sendrecv.c and
prefix_sum.c.
Prerecorded lecture:
youtube link.
- L-29 11/01/24: Parallel Iterative Methods for Linear Systems.
We look at Jacobi's method. Slides.
Some C programs:
use_allgather.c,
jacobi.c, and
jacobi_mpi.c.
Some Julia programs:
jacobi.jl,
mtmatvec.jl,
mtreduce.jl, and
mtjacobi.jl.
Prerecorded lecture:
youtube link.
- L-30 11/04/24: Domain Decomposition Methods.
We continue our investigations into parallel synchronized iterations,
discuss a parallel implementation of the Gauss-Seidel method
and a simple time stepping method for the heat equation.
Slides.
Two simple C programs:
gauss_seidel.c, and
gauss_seidel_omp.c.
Prerecorded lecture:
youtube link.
- L-31 11/06/24: Memory Coalescing Techniques.
Data staging algorithms arrange data so adjacent threads access
adjacent memory locations.
Slides.
One simple Julia program:
gpupwr32cuda.jl.
Prerecorded lecture:
youtube link.
- L-32 11/08/24: An Introduction to Tensor Cores.
Tensor cores are dedicated to matrix multiplication.
A simple matrix multiplication is explained.
Slides.
Prerecorded lecture:
youtube link.
- L-33 11/11/24: Performance Considerations.
We define the notion of a performance cliff and thread coarsening.
At this point we have covered the fundamental aspects of
GPU acceleration.
Slides.
Prerecorded lecture:
youtube link.
6. Applications
- L-34 11/13/24: Parallel FFT and Sorting.
FFT and sorting both have quasi linear complexity.
Slides.
Using FFTW in C:
fftw_use.c,
fftw_timing.c, and
fftw_timing_omp.c.
Simple C programs using qsort:
use_qsort.c,
time_qsort.c,
part_qsort.c, and
part_qsort_omp.c.
Prerecorded lecture:
youtube link.
- L-35 11/15/24: parallel Gaussian elimination.
We consider tiled Cholesky and LU factorizations.
Slides.
Prerecorded lecture:
youtube link.
- L-36 11/18/24: GPU accelerated QR.
The blocked Householder QR is rich in matrix matrix multiplications
and well suited for GPU acceleration.
GPUs capable of teraflop performance can compensate for the cost overhead
of quad double arithmetic.
Slides.
Prerecorded lecture:
youtube link.
- L-37 11/20/24: Case Study: Advanced MRI Reconstruction.
Slides.
Prerecorded lecture:
youtube link.
- L-38 11/22/24: GPU accelerated Newton's method for Taylor series.
Slides.
Prerecorded lecture:
youtube link.