MCS 572 Introduction to Supercomputing

MCS 572: Introduction to Supercomputing

The goal of the course is to study parallel algorithms and their implementation on supercomputers.

The evaluation of the course consists mainly of homework and computer projects. Exams are scheduled as preparation for the computational science prelim.

The recommended (not required) textbooks are

"Parallel Programming. Techniques and Applications Using Networked Workstations and Parallel Computers" by Barry Wilkinson and Michael Allen, Pearson Prentice Hall, second edition, 2005.
"Programming Massively Parallel Processors. A Hands-on Approach" by David B. Kirk and Wen-mei W. Hwu, Elsevier/Morgan Kaufmann Publishers, 2010; fourth edition, 2023, with Izzat El Hajj as third author.

Lecture notes are available

as one single pdf file; or
in html format.

The lecture notes are still a work in progress...

Slides for the lectures will be posted below.

0. Introduction

L-1 08/26/24: welcome to mcs 572. We define supercomputing, speedup and efficiency. Gustafson's Law reevaluates Ahmdahl's Law. Slides
Prerecorded lecture: youtube link.
L-2 08/28/24: scalability and classifications. We classify computers and introduce network topology and terminology. Slides
Prerecorded lecture: youtube link.
L-3 08/30/24: high level parallel processing. We consider multiprocessing with Python and multithreading with Julia. Slides
Python scripts: multiprocess.py, simpson4pi.py, simpson4pi1.py, and simpson4pi2.py.
Jupyter notebook: highlevelpnla.ipynb.
Julia programs: estimatepi1.jl, estimatepidp.jl, and estimatepimt.jl.
Prerecorded lecture: youtube link.

1. Distributed Memory Parallel Computing

L-4 09/04/24: basics of MPI. Message passing is still widely used to program distributed memory parallel computers. We introduce MPI using C, Python, and Julia. Our first MPI programs execute a broadcast of data. Slides
Simple C programs: mpi_hello_world.c, broadcast_integer.c, and broadcast_doubles.c.
Python scripts: mpi4py_hello_world3.py and mpi4py_broadcast3.py.
a Julia program: mpi_hello_world.jl.
Prerecorded lecture: youtube link.
L-5 09/06/24: using MPI. We introduced the broadcasting of data and gathering of results with MPI. Point-to-point communications happen with the send and receive commands. Slides
Simple C programs: parallel_sum.c and parallel_square.c.
Python scripts using mpi4py: mpi4py_point2point.py and mpi4py_parallel_sum.py.
a Julia program using MPI.jl: mpi_point2point.jl
Prerecorded lecture: youtube link.
L-6 09/09/24: parallel simulations. Simulations are examples of ideal parallel computations. However, the generation of the random numbers becomes very important. Slides
Prerecorded lecture: youtube link.
L-7 09/11/24: Load Balancing. We illustrate static and dynamic load balancing on the parallel computation of the Mandelbrot set. Slides
Some simple C programs: mandelbrot.c, static_loaddist.c, and dynamic_loaddist.c.
Static load assignment in Python: static_loaddist.py
Dynamic load balancing in Julia: dynamic_loaddist.jl
Prerecorded lecture: youtube link.
L-8 09/13/24: Hands On Supercomputing. Covers demonstrations of interactive supercomputing on a fast workstation and the preparation of submit scripts to compute on a real supercomputer. Slides
Prerecorded lecture: youtube link.
L-9 09/16/24: Data Partitioning. We define the fan out broadcast and the fanning in of results. Slides.
A simple C program: fan_out_integers.c.
Fan out in Python: fan_out_integers.py.
Fan out in Julia: fan_out_integers.jl.
Prerecorded lecture: youtube link.

2. Shared Memory Parallel Computing

L-10 09/18/24: Introduction to OpenMP. To write parallel programs for shared memory computers we can use OpenMP. Slides.
A Julia program: mtcomptrap.jl.
Simple C programs: hello_openmp0.c, hello_openmp1.c, hello_openmp2.c, comptrap.c, and comptrap_omp.c.
Prerecorded lecture: youtube link.
L-11 09/20/24: The Work Crew Model. We define dynamic work load assignments for parallel shared memory computers, using OpenMP, followed by a brief introduction to pthreads and Julia. Slides.
Simple C programs: jobqueue.h, jobqueue.c, test_jobqueue.c, run_jobqueue_omp.c, hello_pthreads.c, and process_jobqueue.c.
Some Julia programs: hellothreads.jl and workcrew.jl
Prerecorded lecture: youtube link.
L-12 09/23/24: Tasking with OpenMP. Recursive functions can run in parallel. Bernstein's conditions are applied for a dependency analysis. We end with a parallel blocked matrix computation. Slides.
Simple C programs: fib.c, fibomp.c, comptraprec.c, comptraprecomp.c, and matmulomp.c.
Prerecorded lecture: youtube link.
L-13 09/25/24: Tasking with Julia. We redo the last lecture, but now with Julia. Slides.
A slurm script to launch a mpi job: mpi_hello_world.slurm
Simple Julia programs: showargs.jl, showthreads.jl, fibmt.jl, traprulerec.jl, traprulerecmt.jl, mergesortmt.jl, and matmatmulmt.jl.
Prerecorded lecture: youtube link.
L-14 09/27/24: Evaluating Parallel Performance. We look at metrics, isoefficiency, critical path analysis, and the roofline model to evaluate the performance of parallel algorithms. Slides.
Prerecorded lecture: youtube link.
L-15 09/30/24: Work Stealing. For job assignment, we can view work stealing as a hybrid between static and dynamic work assignment, as defined by a Julia program. The Threading Building Blocks provide a high level programming model for C++ programmers. Slides.
A simple Julia program: worksteal.jl
A Python script with Numba: montecarlo4pi.py
A Python script with Parsl: parslmapreduce.py
C++ programs: hello_task_group.cpp, powers_serial.cpp, powers_tbb.cpp, and parsum_tbb.cpp
Prerecorded lecture: youtube link.

3. Acceleration with Graphics Processing Units

L-16 10/02/24: a Massively Parallel Processor: the GPU. We introduce the Graphics Processing Unit for massive parallel computing, applying data parallelism. Slides.
Prerecorded lecture: youtube link.
L-17 10/04/24: programming GPUs with PyCUDA and Julia. We can launch kernels in Python with PyCUDA and Julia provides vendor agnostic GPU programming. Slides.
a simple Python script: matmatmulcuda.py
simple Julia programs: gpuadd.jl, gpuadd1metal.jl, and mmmulmetal.jl
Prerecorded lecture: youtube link.
L-18 10/07/24: Introduction to CUDA. Writing CUDA programs is a 5-step process. Slides.
Our first CUDA programs: cudaDoubleComplex.cu and runCudaComplexSqrt.cu.
a simple Julia program: gpuadd3.jl
Prerecorded lecture: youtube link.
L-19 10/09/24: Data Parallelism and Matrix Multiplication. The CUDA program structure is illustrated on the matrix-matrix multiplication problem. Slides.
A simple C program: matmatmul0.c;
and two simple CUDA programs: matmatmul1.cu and matmatmul2.cu
and two Julia program: mmmulcuda2.jl. mmmulmetal2.jl.
Prerecorded lecture: youtube link.
L-20 10/11/24: Device Memories and Matrix Multiplication. Understanding different memories matters in the calculation of the expected performance of kernels. We define the Compute to Global Memory Access ratio. Slides.
a Julia program: dotprod.jl.
Prerecorded lecture: youtube link.
L-21 10/14/24: Thread Organization and Matrix Multiplication. We look at the matrix multiplication problem from the perspective of the organization of threads. Slides.
A simple CUDA program: organization.cu.
A corresponding Julia program with CUDA.jl: organization.jl.
Prerecorded lecture: youtube link.
L-22 10/16/24: Warps and reduction algorithms. We discuss the development of a kernel with less thread divergence to reduce a sequence of numbers. Slides.
Some simple Julia programs: gpusum32metal.jl, gpusum32metal2.jl, gpusum32cuda.jl, and firstsum.jl.
Prerecorded lecture: youtube link.
L-23 10/18/24: review for the midterm exam. Slides.
Prerecorded lecture: youtube link.

4. Pipelining and Synchronized Computations

L-25 10/23/24: Pipelined Computations. As in the manufactoring of cars, arranging the stages in a computation along a pipeline speeds up the computations. Slides.
A Julia program: leibnizunrolled.jl.
Some simple C programs: pipe_ring.c and pipe_sum.c.
Prerecorded lecture: youtube link.
L-26 10/25/24: Pipelined Sorting and Sieving. We illustrate pipelined sorting algorithms with MPI. Type 2 pipelines are introduced with sieving for primes. Type 3 pipelines are illustrated by triangular linear systems. Slides.
One simple C program: pipe_sort.c
Prerecorded lecture: youtube link.
L-27 10/28/24: Solving Triangular Systems. We consider the solving of triangular linear systems, using first a high level OpenMP implementation and then a GPU accelerated solver with CUDA, both in multiple double precision. Slides.
Some simple C++ programs: qd4sqrt2.cpp, triangularsolve.cpp, triangularsolve_qd.cpp (with QD), and triangularsolve_qd_omp.cpp (with QD and OpenMP).
Prerecorded lecture: youtube link.
L-28 10/30/24: Barriers for Synchronizations. For message passing, we distinguish between a linear, a tree, and a butterfly barrier. We end with the PRAM model and Brent's theorem. Slides.
Two simple C programs with MPI: use_sendrecv.c and prefix_sum.c.
Prerecorded lecture: youtube link.
L-29 11/01/24: Parallel Iterative Methods for Linear Systems. We look at Jacobi's method. Slides.
Some C programs: use_allgather.c, jacobi.c, and jacobi_mpi.c.
Some Julia programs: jacobi.jl, mtmatvec.jl, mtreduce.jl, and mtjacobi.jl.
Prerecorded lecture: youtube link.
L-30 11/04/24: Domain Decomposition Methods. We continue our investigations into parallel synchronized iterations, discuss a parallel implementation of the Gauss-Seidel method and a simple time stepping method for the heat equation. Slides.
Two simple C programs: gauss_seidel.c, and gauss_seidel_omp.c.
Prerecorded lecture: youtube link.
L-31 11/06/24: Memory Coalescing Techniques. Data staging algorithms arrange data so adjacent threads access adjacent memory locations. Slides.
One simple Julia program: gpupwr32cuda.jl.
Prerecorded lecture: youtube link.
L-32 11/08/24: An Introduction to Tensor Cores. Tensor cores are dedicated to matrix multiplication. A simple matrix multiplication is explained. Slides.
Prerecorded lecture: youtube link.
L-33 11/11/24: Performance Considerations. We define the notion of a performance cliff and thread coarsening. At this point we have covered the fundamental aspects of GPU acceleration. Slides.
Prerecorded lecture: youtube link.

6. Applications

L-34 11/13/24: Parallel FFT and Sorting. FFT and sorting both have quasi linear complexity. Slides.
Using FFTW in C: fftw_use.c, fftw_timing.c, and fftw_timing_omp.c.
Simple C programs using qsort: use_qsort.c, time_qsort.c, part_qsort.c, and part_qsort_omp.c.
Prerecorded lecture: youtube link.
L-35 11/15/24: parallel Gaussian elimination. We consider tiled Cholesky and LU factorizations. Slides.
Prerecorded lecture: youtube link.
L-36 11/18/24: GPU accelerated QR. The blocked Householder QR is rich in matrix matrix multiplications and well suited for GPU acceleration. GPUs capable of teraflop performance can compensate for the cost overhead of quad double arithmetic. Slides.
Prerecorded lecture: youtube link.
L-37 11/20/24: Case Study: Advanced MRI Reconstruction. Slides.
Prerecorded lecture: youtube link.
L-38 11/22/24: GPU accelerated Newton's method for Taylor series. Slides.
Prerecorded lecture: youtube link.

L-39 11/25/24: a second review. Slides.
Prerecorded lecture: youtube link.