Data Parallelism and Matrix Multiplication ========================================== Matrix multiplication is one of the fundamental building blocks in numerical linear algebra, and therefore in scientific computation. In this section, we investigate how data parallelism may be applied to solve this problem. Data Parallelism ---------------- Many applications process large amounts of data. Data parallelism refers to the property where many arithmetic operations can be safely performed on the data simultaneously. Consider the multiplication of matrices :math:`A` and :math:`B` which results in :math:`C = A \cdot B`, with .. math:: A = [a_{i,j}] \in {\mathbb R}^{n \times m}, \quad B = [b_{i,j}] \in {\mathbb R}^{m \times p}, \quad C = [c_{i,j}] \in {\mathbb R}^{n \times p}. :math:`c_{i,j}` is the inner product of the *i*-th row of :math:`A` with the *j*-th column of :math:`B`: .. math:: c_{i,j} = \sum_{k=1}^m a_{i,k} \cdot b_{k,j}. All :math:`c_{i,j}`'s can be computed independently from each other. For :math:`n = m = p = 1,000` we have 1,000,000 inner products. The matrix multiplication is illustrated in :numref:`figdataparmatmul`. .. _figdataparmatmul: .. figure:: ./figdataparmatmul.png :align: center Data parallelism in matrix multiplication. Code for Matrix-Matrix Multiplication ------------------------------------- Code for a device (the GPU) is defined in functions using the keyword ``__global__`` before the function definition. Data parallel functions are called *kernels*. Kernel functions generate a large number of threads. In matrix-matrix multiplication, the computation can be implemented as a kernel where each thread computes one element in the result matrix. To multiply two 1,000-by-1,000 matrices, the kernel using one thread to compute one element generates 1,000,000 threads when invoked. CUDA threads are much lighter weight than CPU threads: they take very few cycles to generate and schedule thanks to efficient hardware support whereas CPU threads may require thousands of cycles. A CUDA program consists of several phases, executed on the host: if no data parallelism, or on the device: for data parallel algorithms. The NVIDIA C compiler ``nvcc`` separates phases at compilation. Code for the host is compiled on host's standard C compilers and runs as ordinary CPU process. device code is written in C with keywords for data parallel functions and further compiled by ``nvcc``. A CUDA program has the following structure: :: CPU code kernel<<>>(args) CPU code The number of blocks in a grid is illustrated in :numref:`figgpugridblocks`. .. _figgpugridblocks: .. figure:: ./figgpugridblocks.png :align: center The organization of the threads of execution in a CUDA program. For the matrix multiplication :math:`C = A \cdot B`, we run through the following stages: 1. Allocate device memory for :math:`A`, :math:`B`, and :math:`C`. 2. Copy :math:`A` and :math:`B` from the host to the device. 3. Invoke the kernel to have device do :math:`C = A \cdot B`. 4. Copy :math:`C` from the device to the host. 5. Free memory space on the device. A linear address system is used for the 2-dimensional array. Consider a 3-by-5 matrix stored row-wise (as in C), as shown in :numref:`figlinearmatrix`. .. _figlinearmatrix: .. figure:: ./figlinearmatrix.png :align: center Storing a matrix as a one dimensional array. Code to generate a random matrix is below: :: #include __host__ void randomMatrix ( int n, int m, float *x, int mode ) /* * Fills up the n-by-m matrix x with random * values of zeroes and ones if mode == 1, * or random floats if mode == 0. */ { int i,j,r; float *p = x; for(i=0; i __host__ void writeMatrix ( int n, int m, float *x ) /* * Writes the n-by-m matrix x to screen. */ { int i,j; float *p = x; for(i=0; i>>(n,m,p, Adevice,Bdevice,Cdevice); /* copy matrix C from device to the host */ cudaMemcpy(Chost,Cdevice,sC,cudaMemcpyDeviceToHost); /* freeing memory on the device */ cudaFree(Adevice); cudaFree(Bdevice); cudaFree(Cdevice); if(mode == 1) { printf("the resulting %d-by-%d matrix C :\n",n,p); writeMatrix(n,p,Chost); } Two Dimensional Arrays of Threads --------------------------------- Using ``threadIdx.x`` and ``threadIdx.y`` instead of a one dimensional organization of the threads in a block we can make the :math:`(i,j)`-th thread compute :math:`c_{i,j}`. The main program is then changed into :: /* kernel invocation launching n*p threads */ dim3 dimGrid(1,1); dim3 dimBlock(n,p); matrixMultiply<<>> (n,m,p,Adevice,Bdevice,Cdevice); The above construction creates a grid of one block. The block has :math:`n \times p` threads: * ``threadIdx.x`` will range between 0 and :math:`n-1`, and * ``threadIdx.y`` will range between 0 and :math:`p-1`. The new kernel is then: :: __global__ void matrixMultiply ( int n, int m, int p, float *A, float *B, float *C ) /* * Multiplies the n-by-m matrix A * with the m-by-p matrix B into the matrix C. * The (i,j)-th thread computes the (i,j)-th element of C. */ { int i = threadIdx.x; int j = threadIdx.y; int ell = i*p + j; C[ell] = 0.0; float *pB; for(int k=0; k