Programming GPUs with PyCUDA and Julia ====================================== High level GPU programming can be done in Python or Julia. Before introducing PyCUDA and GPU programming with Julia, the data parallel model is described. Data Parallelism ---------------- The programming model is :index:`Single Instruction Multiple Data` (:index:`SIMD`). In the application of :index:`data parallelism`, blocks of threads read from memory, execute the same instruction(s), and then write the results back to memory. To fully occupy a massively parallel processor such as the GPU, at least 10,000 threads are needed. Code that runs on the GPU is defined in a function, the so-called :index:`kernel`. The scalable programming model of the GPU is illustrated in :numref:`figscalprogmodel`. .. _figscalprogmodel: .. figure:: ./figscalprogmodel.png :align: center A scalable programming model. Each core represents a streaming multiprocessor. Streaming multiprocessors schedule the blocks of threads for execution in groups of 32 threads, called a :index:`warp`. The dual warp scheduler is illustrated in :numref:`figwarpscheduler`. .. _figwarpscheduler: .. figure:: ./figwarpscheduler.png :align: center Dual warp scheduler, copied from the NVIDIA documentation. A kernel launch creates a grid of blocks, and each block has one or more threads. The organization of the grids and blocks can be 1D, 2D, or 3D. During the running of the kernel, threads in the same block are executed simultaneously. The programming model for NVIDIA GPUs is called :index:`CUDA` which stands for Compute Unified Device Architecture. Programming GPUs is more complicated because it is not possible to quickly add print statements when debugging the code. As illustrated in :numref:`figcpuGPUrendering` every GPU (called the device) is connected to a CPU (called the host). Kernels are lauched by the host for execution on the device. .. _figcpuGPUrendering: .. figure:: ./figcpuGPUrendering.png :align: center From the GeForce 8 and 9 Series GPU Programming Guide (NVIDIA). Matrix Matrix Multiplication ---------------------------- Our scientific running example of data parallel computations is the multiplication of two matrices. Consider the multiplication of matrices :math:`A` and :math:`B` which results in :math:`C = A \cdot B`, with .. math:: A = [a_{i,j}] \in {\mathbb R}^{n \times m}, \quad B = [b_{i,j}] \in {\mathbb R}^{m \times p}, \quad C = [c_{i,j}] \in {\mathbb R}^{n \times p}. :math:`c_{i,j}` is the inner product of the *i*-th row of :math:`A` with the *j*-th column of :math:`B`: .. math:: c_{i,j} = \sum_{k=1}^m a_{i,k} \cdot b_{k,j}. All :math:`c_{i,j}`'s can be computed independently from each other. For :math:`n = m = p = 1,024` we have 1,048,576 inner products. The matrix multiplication is illustrated in :numref:`figdataparmatmul`. .. _figdataparmatmul: .. figure:: ./figdataparmatmul.png :align: center Data parallelism in matrix multiplication. PyCUDA ------ Code for the GPU can be generated in Python, see :numref:`figpycudaprinciple`, as described in the following paper by A. Kloeckner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih: **PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation.** *Parallel Computing*, 38(3):157--174, 2012. .. _figpycudaprinciple: .. figure:: ./figpycudaprinciple.png :align: center The operating principle of GPU code generation. To verify whether PyCUDA is correctly installed on our computer, we can run an interactive Python session as follows. :: >>> import pycuda >>> import pycuda.autoinit >>> from pycuda.tools import make_default_context >>> c = make_default_context() >>> d = c.get_device() >>> d.name() 'Tesla P100-PCIE-16GB' We illustrate the matrix-matrix multiplication on the GPU with code generated in Python. We multipy an *n*-by-*m* matrix with an *m*-by-*p* matrix with a two dimensional grid of :math:`n \times p` threads. For testing we use 0/1 matrices. :: $ python matmatmul.py matrix A: [[ 0. 0. 1. 0.] [ 0. 0. 1. 1.] [ 0. 1. 1. 0.]] matrix B: [[ 1. 1. 0. 1. 1.] [ 1. 0. 1. 0. 0.] [ 0. 0. 1. 1. 0.] [ 0. 0. 1. 1. 0.]] multiplied A*B: [[ 0. 0. 1. 1. 0.] [ 0. 0. 2. 2. 0.] [ 1. 0. 2. 1. 0.]] $ The script starts with the import of the modules and type declarations. :: import pycuda.driver as cuda import pycuda.autoinit from pycuda.compiler import SourceModule import numpy (n, m, p) = (3, 4, 5) n = numpy.int32(n) m = numpy.int32(m) p = numpy.int32(p) a = numpy.random.randint(2, size=(n, m)) b = numpy.random.randint(2, size=(m, p)) c = numpy.zeros((n, p), dtype=numpy.float32) a = a.astype(numpy.float32) b = b.astype(numpy.float32) The script then continues with the memory allocation and the copying from host to device. :: a_gpu = cuda.mem_alloc(a.size * a.dtype.itemsize) b_gpu = cuda.mem_alloc(b.size * b.dtype.itemsize) c_gpu = cuda.mem_alloc(c.size * c.dtype.itemsize) cuda.memcpy_htod(a_gpu, a) cuda.memcpy_htod(b_gpu, b) The definition of the kernel follows: :: mod = SourceModule(""" __global__ void multiply ( int n, int m, int p, float *a, float *b, float *c ) { int idx = p*threadIdx.x + threadIdx.y; c[idx] = 0.0; for(int k=0; k`_ documents the organization to unify the many packages for programming GPUs in Julia. A first kernel is taken from the tutorials at the JuliaGPU site: :: using CUDA using Test function gpu_add1!(y, x) for i = 1:length(y) @inbounds y[i] += x[i] end return nothing end N = 2^20 # adding one million Float32 numbers x_d = CUDA.fill(1.0f0, N) # stored on GPU filled with 1.0 y_d = CUDA.fill(2.0f0, N) # stored on GPU filled with 2.0 fill!(y_d, 2) @cuda gpu_add1!(y_d, x_d) result = (@test all(Array(y_d) .== 3.0f0)) println(result) Observe that, unlike with PyCUDA, the kernel is defined as a Julia function. To make the point of {vendor agnostic GPU computing, in addition to CUDA.jl, the following packages are available: * AMDGPU.jl for AMD GPUs, * oneAPI.jl for the Intel oneAPI, * Metal.jl to program GPUs in Apple hardware. The code to add vectors on an M1 MacBook Air GPU with Metal looks very similar to the code with CUDA.jl. :: using Metal using Test function gpu_add1!(y, x) for i = 1:length(y) @inbounds y[i] += x[i] end return nothing end N = 32 x_d = Metal.fill(1.0f0, N) # filled with Float32 1.0 on GPU y_d = Metal.fill(2.0f0, N) # filled with Float32 2.0 # run with N threads @metal threads=N gpu_add1!(y_d, x_d) result = (@test all(Array(y_d) .== 3.0f0)) println(result) As a last illustration, consider the multiplying of matrices with ``Metal``. The CUDA version of the example is copied from the Julia for High-Performance Scientific Computing web site, adjusted 1. using ``Metal`` instead of using ``CUDA``, 2. work with ``Float32`` instead of ``Float64``, 3. use ``MtlArray`` instead of ``CuArray``. The code is below, using the operator ``*``: :: using Metal using BenchmarkTools dim = 2^10 A_h = rand(Float32, dim, dim); A_d = MtlArray(A_h); @btime $A_h * $A_h; @btime $A_d * $A_d; When executed on a 2020 M1 Macbook Air, what is printed is 6.229 ms and 23.625 :math:`\mu\mbox{s}` for CPU and GPU respectively. Observe the difference in orders of magnitude, milliseconds versus microseconds. Even as the GPU on this laptop is not intended for high performance scientific computations, it outperforms the CPU. Bibliography ------------ 1. T. Besard, C. Foket, and B. De Sutter: **Effective Extensible Programming: Unleashing Julia on GPUs.** *IEEE Transactions on Parallel and Distributed Systems*, Vol. 30, No. 4, pages 827--841, 2019. 2. A. Kloeckner, N. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih. **PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation.** *Parallel Computing*, 38(3):157--174, 2012. 3. U. Utkarsh, V. Churavy, Y. Ma, T. Besard, P. Srisuma, T. Gymnich, A. R. Gerlach, A. Edelman, G. Barbastathis, R. D. Braatz, and C. Rackauckas: **Automated Translation and Accelerated Solving of Differential Equations on Multiple GPU Platforms.** *Computer Methods in Applied Mechanics and Engineering*, Vol. 419, 2024, article 116591. Exercises --------- 1. The matrix matrix multiplication example with PyCUDA uses :math:`(3, 4, 5)` as values for :math:`(n, m, p)`. Considering massive parallelism, what are the largest dimensions you could consider for one block of threads on the P100 and/or the A100? Illustrate your values for the dimensions experimentally. 2. In the PyCUDA matrix matrix multiplication, change the ``float32`` types into ``float64`` and redo the previous exercise. Time the code. Do you notice a difference? 3. On your own computer, check the vendor of the GPU and run the equivalent ``gpuadd.jl`` after installing the proper Julia package. Report on the performance, relative to the CPU in your computer.