The goal of the course is to study parallel algorithms and their implementation on supercomputers.

The evaluation of the course consists mainly of homework and computer projects. Exams are scheduled as preparation for the computational science prelim.

The recommended (not required) textbooks are

**"Parallel Programming. Techniques and Applications Using Networked Workstations and Parallel Computers"**by Barry Wilkinson and Michael Allen, Pearson Prentice Hall, second edition, 2005.**"Programming Massively Parallel Processors. A Hands-on Approach"**by David B. Kirk and Wen-mei W. Hwu, Elsevier/Morgan Kaufmann Publishers, 2010; second edition, 2013.

Slides for the lectures will be posted below.

- L-1 01/11/21: welcome to mcs 572. We define supercomputing, speedup and efficiency. Gustafson's Law reevaluates Ahmdahl's Law. Slides
- L-2 01/13/21: scalability and classifications. We classify computers and introduce network topology and terminology. Slides
- L-3 01/15/21: high level parallel processing.
We consider multiprocessing with Python and multithreading with Julia.
Slides

Python scripts: multiprocess.py, simpson4pi.py, simpson4pi1.py, and simpson4pi2.py.

Julia programs: estimatepi1.jl and estimatepimt.jl.

- L-4 01/20/21: basics of MPI.
We are getting started with message passing, using C and Python.
Our first MPI programs execute a broadcast of data.
Slides

Simple C programs: mpi_hello_world.c, broadcast_integer.c, and broadcast_doubles.c.

Python scripts: mpi4py_hello_world3.py and mpi4py_broadcast3.py. - L-5 01/22/21: using MPI.
We introduced the broadcasting of data and gathering of results with MPI.
Point-to-point communications happen with the send and receive commands.
Slides

Simple C programs: parallel_sum.c and parallel_square.c.

Python scripts using mpi4py: mpi4py_point2point.py and mpi4py_parallel_sum.py. - L-6 01/25/21: pleasingly parallel computations. The focus is on interactive supercomputing. Slides
- L-7 01/27/21: simulations.
Simulations are examples of ideal parallel computations.
However, the generation of the random numbers becomes very important.
Slides

Simple C programs sprng_hello.c, sprng_seed.c, sprng_estpi.c, sprng_estpi_mpi.c, sprng_normal.c, sprng_normald.c, and sprng_mtbf.c.

The sprng_makefile contains the compilation instructions. Save as`makefile`. - L-8 01/29/21: introduction to supercomputing at UIC.
- L-9 02/01/21: Load Balancing.
We illustrate static and dynamic load balancing on the parallel
computation of the Mandelbrot set.
Slides

Some simple C programs: mandelbrot.c, static_loaddist.c, and dynamic_loaddist.c. - L-10 02/03/21: Load Balancing with mpi4py.
In this lecture we cover the Python equivalents to the C programs
for load balancing of the last lecture.
Slides

Two Python scripts: static_loaddist.py and dynamic_loaddist.py. - L-11 02/05/21: Data Partitioning.
We define the fan out broadcast and the fanning in of results.
Slides.

A simple C program: fan_out_integers.c.

- L-12 02/08/21: Introduction to OpenMP.
To write parallel programs for shared memory computers
we can use OpenMP.
Slides.

Simple C programs: hello_openmp0.c, hello_openmp1.c, hello_openmp2.c, comptrap.c, and comptrap_omp.c. - L-13 02/10/21: The Work Crew Model.
We define dynamic work load assignments for parallel
shared memory computers, using OpenMP, followed by
a brief introduction to pthreads.
Slides.

Simple C programs: jobqueue.h, jobqueue.c, test_jobqueue.c, run_jobqueue_omp.c, hello_pthreads.c, and process_jobqueue.c.

The jobqueue_makefile contains the compilation instructions. Save as`makefile`. - L-14 02/12/21: Introduction to Tasking. We illustrate tasking with the Intel Treading Building Blocks. Slides.
- L-15 02/15/21: Tasking with OpenMP.
Recursive functions can run in parallel.
Bernstein's conditions are applied for a dependency analysis.
We end with a parallel blocked matrix computation.
Slides.

Simple C programs: fibomp.c, comptraprec.c, comptraprecomp.c, and matmulomp.c. - L-16 02/17/21: Tasking with Julia.
We redo the last lecture, but now with Julia.
Slides.

Simple Julia programs: showargs.jl, showthreads.jl, fibmt.jl, traprulerec.jl, traprulerecmt.jl, mergesortmt.jl, and matmatmulmt.jl. - L-17 02/19/21: Evaluating Parallel Performance. We look at metrics, isoefficiency, and critical path analysis to evaluate the performance of parallel algorithms. Slides.

- L-18 02/22/21: a Massively Parallel Processor: the GPU. We introduce the Graphics Processing Unit for massive parallel computing, applying data parallelism. Slides.
- L-19 02/24/21: the Evolution of Graphics Pipelines. The historical evolution of graphics processors explains the current execution model of GPUs. Slides.
- L-20 02/26/21: programming GPUs.
We introduced OpenCL, PyOpenCL, and PyCUDA.
Slides.

Simple Python scripts: matmatmulocl.py, matmatmulcuda.py, and matmatmulsdk.py. - L-21 03/01/21: Introduction to CUDA. Writing CUDA programs is
a 5-step process.
Slides.

Our first CUDA programs: cudaDoubleComplex.cu and runCudaComplexSqrt.cu. - L-22 03/03/21: Data Parallelism and Matrix Multiplication.
The CUDA program structure is illustrated on the matrix-matrix
multiplication problem.
Slides.

A simple C program: matmatmul0.c;

and two simple CUDA programs: matmatmul1.cu and matmatmul2.cu. - L-23 03/05/21: Device Memories and Matrix Multiplication. Understanding different memories matters in the calculation of the expected performance of kernels. We define the Compute to Global Memory Access ratio. Slides.
- L-24 03/08/21: Thread Organization and Matrix Multiplication.
We look at the matrix multiplication problem from the perspective
of the organization of threads.
Slides.

A simple CUDA program: organization.cu.

- L-27 03/15/21: Pipelined Computations.
As in the manufactoring of cars, arranging the stages in a computation
along a pipeline speeds up the computations.
Slides.

Some simple C programs: pipe_ring.c and pipe_sum.c. - L-28 03/17/21: Pipelined Sorting and Sieving.
We illustrate pipelined sorting algorithms with MPI.
Type 2 pipelines are introduced with sieving for primes.
Slides.

One simple C program: pipe_sort.c - L-29 03/19/21: Solving Triangular Systems.
We introduce the third type of pipelines
and consider the solving of triangular linear systems.
Slides.

Some simple C++ programs: triangularsolve.cpp, triangularsolve_qd.cpp (with QD), and triangularsolve_qd_omp.cpp (with QD and OpenMP).

- L-30 03/29/21: Barriers for Synchronizations. For message passing, we distinguish between a linear, a tree, and a butterfly barrier. We end with a simple illustration of barriers with Pthreads. Slides.
- L-31 03/31/21: Parallel Iterative Methods for Linear Systems.
We look at Jacobi's method. Slides.

Some simple C programs: use_allgather.c, jacobi.c, and jacobi_mpi.c. - L-32 04/02/21: Solving the Heat Equation.
We continue our investigations into parallel synchronized iterations.
Slides.

Two simple C programs: gauss_seidel.c, and gauss_seidel_omp.c. - L-33 04/05/21: Warps and reduction algorithms. We discuss the development of a kernel with less thread divergence to reduce a sequence of numbers. Slides.
- L-34 04/07/21: Memory Coalescing Techniques. Slides.
- L-35 04/09/21: Performance Considerations. Slides.

- L-36 04/12/21: Parallel FFT and Sorting. FFT and sorting both have quasi linear complexity. Slides.
- L-37 04/14/21: parallel Gaussian elimination. We consider tiled Cholesky and LU factorizations. Slides.
- L-38 04/16/21: parallel numerical linear algebra. We continue the previous lecture with a tiled QR decomposition and Krylov subspace iterations. Slides.
- L-39 04/19/21: quad doubles on a GPU. Slides.
- L-40 04/21/21: Case Study: Advanced MRI Reconstruction. Slides.
- L-41 04/23/21: Concurrent Kernels and Multiple GPUs. Slides.