Thread Organization and Matrix Multiplication ============================================= In this lecture we look at the problem of multiplying two matrices, from the perspective of the thread organization. Thread Organization ------------------- The code that runs on the GPU is defined in a function, the kernel. A kernel launch creates a grid of blocks, and each block has one or more threads. The organization of the grids and blocks can be 1D, 2D, or 3D. During the running of the kernel: * Threads in the same block are executed simultaneously. * Blocks are scheduled by the streaming multiprocessors. The NVIDIA Tesla C2050 has 14 streaming multiprocessors and threads are executed in groups of 32 (the warp size). This implies: :math:`14 \times 32 = 448` threads can run simultaneously. For the K20C the numbers are respectively 13, 192, and 2496; and for the P100, we have respectively 56, 64, and 3584. A picture of the scalable programming model was shown in :numref:`figscalprogmodel`. All threads execute the same code, defined by the kernel. The builtin variable ``threadIdx`` * identifies every thread in a block uniquely; and * defines the data processed by the thread. The builtin variable ``blockDim`` holds the number of threads in a block. In a one dimensional organization, we use only ``threadIdx.x`` and ``blockDim.x``. For 2D and 3D, the other components * ``threadIdx.y`` belongs to the range 0 .. ``blockDim.y``; * ``threadIdx.z`` belongs to the range 0 .. ``blockDim.z``. The grid consists of *N* blocks, with :math:`{\tt blockIdx.x} \in \{0,N-1\}`. Within each block, :math:`{\tt threadIdx.x} \in \{ 0, {\tt blockDim.x}-1\}`. The organization of the data for each thread is in :numref:`figdata4thread`. .. _figdata4thread: .. figure:: ./figdata4thread.png :align: center Data mapped to threads with block and thread indices. Suppose the kernel is defined by the function ``F`` with input arguments ``x`` and output arguments ``y``, then the execution configuration parameters are set as below: :: dim3 dimGrid(128,1,1); dim3 dimBlock(32,1,1); F<<>>(x,y); which launches a grid of 128 blocks. The grid is a one dimensional array. Each block in the grid is also one dimensional and has 32 threads. \begin{frame}{multidimensional thread organization} The limitations of the Tesla C2050/C2070 are as follows: * Maximum number of threads per block: 1,024. * Maximum sizes of each dimension of a block: :math:`1,024 \times 1,024 \times 64`. Because 1,024 is the upper limit for the number of threads in a block, the largest square 2D block is :math:`32 \times 32`, as :math:`32^2 = 1,024`. * Maximum sizes of each dimension of a grid: :math:`65,535 \times 65,535 \times 65,535`. 65,535 is the upper limit for the builtin variables ``gridDim.x``, ``gridDim.y``, and ``gridDim.z``. On the K20C and the P100, the limitations are as follows: * Maximum number of threads per block: 1,024. * Maximum dimension size of a thread block: :math:`1,024 \times 1,024 \times 64`. * Maximum dimension size of a grid size: :math:`2,147,483,647 \times 65,535 \times 65,535`. Consider the following 3D example. Suppose the function ``F`` defines the kernel, with argument ``x``, then :: dim3 dimGrid(3,2,4); dim3 dimBlock(5,6,2); F<<>>(x); launches a grid with * :math:`3 \times 2 \times 4` blocks; and * each block has :math:`5 \times 6 \times 2` threads. Matrix Matrix Multiplication ---------------------------- With a three dimensional grid we can define submatrices. Consider for example a grid of dimension :math:`2 \times 2 \times 1` to store a 4-by-4 matrix in tiles of dimensions :math:`2 \times 2 \times 1`, as in :numref:`figsubmatrixgrid`. .. _figsubmatrixgrid: .. figure:: ./figsubmatrixgrid.png :align: center Storing a tiled matrix in a grid. A kernel launch with a grid of dimensions :math:`2 \times 2 \times 1` where each block has dimensions :math:`2 \times 2 \times 1` creates 16 threads. The mapping of the entries in the matrix to threads is illustrated in :numref:`figsubmatrixgridmap`. .. _figsubmatrixgridmap: .. figure:: ./figsubmatrixgridmap.png :align: center Mapping threads to entries in the matrix. A kernel launch with a grid of dimensions :math:`2 \times 2 \times 1` where each block has dimensions :math:`2 \times 2 \times 1` creates 16 threads. The linear address calculation is illustrated in :numref:`figsubmatrixgridaddress`. .. _figsubmatrixgridaddress: .. figure:: ./figsubmatrixgridaddress.png :align: center Linear address calculation for threads and submatrices. The main function in the CUDA code to organized the threads is listed below. :: int main ( int argc, char* argv[] ) { const int xb = 2; /* gridDim.x */ const int yb = 2; /* gridDim.y */ const int zb = 1; /* gridDim.z */ const int xt = 2; /* blockDim.x */ const int yt = 2; /* blockDim.y */ const int zt = 1; /* blockDim.z */ const int n = xb*yb*zb*xt*yt*zt; printf("allocating array of length %d...\n",n); /* allocating and initializing on the host */ int *xhost = (int*)calloc(n,sizeof(int)); for(int i=0; i>>(xdevice); The kernel is defined in the code below. :: __global__ void matrixFill ( int *x ) /* * Fills the matrix using blockIdx and threadIdx. */ { int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; int row = by*blockDim.y + ty; int col = bx*blockDim.x + tx; int dim = gridDim.x*blockDim.x; int i = row*dim + col; x[i] = i; } Then the main program continues with the copying to host and writing the result. :: /* copy data from device to host */ cudaMemcpy(xhost,xdevice,sx,cudaMemcpyDeviceToHost); cudaFree(xdevice); int *p = xhost; for(int i1=0; i1 < xb; i1++) for(int i2=0; i2 < yb; i2++) for(int i3=0; i3 < zb; i3++) for(int i4=0; i4 < xt; i4++) for(int i5=0; i5 < yt; i5++) for(int i6=0; i6 < zt; i6++) printf("x[%d][%d][%d][%d][%d][%d] = %d\n", i1,i2,i3,i4,i5,i6,*(p++)); return 0; } In a block all threads run independently. CUDA allows threads in the same block to coordinate their activities using a barrier synchronization function: ``__syncthreads()``. The thread executing ``__syncthreads()`` will be held at the calling location in the code until every thread in the block reaches the location. Placing a ``__syncthreads()`` ensures that all threads in a block have completed a task before moving on. Consider the tiled matrix multiplication, as shown in :numref:`figsubmatrixtiledmul`. .. _figsubmatrixtiledmul: .. figure:: ./figsubmatrixtiledmul.png :align: center Tiled matrix multiplication with shared memory. With tiled matrix matrix multiplication using shared memory, all threads in the block collaborate to copy the tiles :math:`A_{i,k}` and :math:`B_{k,j}` from global memory to shared memory. Here is the need for thread synchronization. Before the calculation of the inner products, all threads must finish their copy statement: they all execute the ``__syncthreads()``. Every thread computes one inner product. After this computation, another synchronization is needed. Before moving on to the next tile, all threads must finish, therefore, they all execute the ``__syncthreads()`` after computing their inner product and moving on to the next phase. Let us then revisit the kernel of ``matrixMul`` and consider the code below. :: template __global__ void matrixMul( float* C, float* A, float* B, int wA, int wB) { int bx = blockIdx.x; // Block index int by = blockIdx.y; int tx = threadIdx.x; // Thread index int ty = threadIdx.y; // Index of the first sub-matrix of A processed by the block int aBegin = wA * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed by the block int aEnd = aBegin + wA - 1; // Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE; // Index of the first sub-matrix of B processed by the block int bBegin = BLOCK_SIZE * bx; // Step size used to iterate through the sub-matrices of B int bStep = BLOCK_SIZE * wB; // Csub is used to store the element of the block sub-matrix // that is computed by the thread float Csub = 0; // Loop over all the sub-matrices of A and B // required to compute the block sub-matrix for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) { // Declaration of the shared memory array As used to // store the sub-matrix of A __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; // Declaration of the shared memory array Bs used to // store the sub-matrix of B __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE]; // Load the matrices from device memory // to shared memory; each thread loads // one element of each matrix AS(ty, tx) = A[a + wA * ty + tx]; BS(ty, tx) = B[b + wB * ty + tx]; // Synchronize to make sure the matrices are loaded __syncthreads(); // Multiply the two matrices together; // each thread computes one element // of the block sub-matrix #pragma unroll for (int k = 0; k < BLOCK_SIZE; ++k) Csub += AS(ty, k) * BS(k, tx); // Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of A and B in the next iteration __syncthreads(); } // Write the block sub-matrix to device memory; // each thread writes one element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub; } Bibliography ------------ 1. NVIDIA CUDA Programming Guide. Available at . 2. Vasily Volkov and James W. Demmel: **Benchmarking GPUs to tune dense linear algebra.** In *Proceedings of the 2008 ACM/IEEE conference on Supercomputing*. IEEE Press, 2008. Article No. 31. Exercises --------- 1. Investigate the performance for the matrix-matrix multiplication with PyCUDA, comparing with the ``numpy`` implementation. 2. Find the limitations of the grid and block sizes for the graphics card on your laptop or desktop. 3. Extend the simple code with the three dimensional thread organization to a tiled matrix-vector multiplication for numbers generated at random as 0 or 1.