Thread Organization and Matrix Multiplication
=============================================

In this lecture we look at the problem of multiplying two matrices,
from the perspective of the thread organization.

Thread Organization
-------------------

The code that runs on the GPU is defined in a function, the kernel.
A kernel launch creates a grid of blocks, and
each block has one or more threads.
The organization of the grids and blocks can be 1D, 2D, or 3D.

During the running of the kernel:

* Threads in the same block are executed simultaneously.
* Blocks are scheduled by the streaming multiprocessors.

The NVIDIA Tesla C2050 has 14 streaming multiprocessors
and threads are executed in groups of 32 (the warp size).
This implies: :math:`14 \times 32 = 448` threads can run simultaneously.
For the K20C the numbers are respectively 13, 192, and 2496;
and for the P100, we have respectively 56, 64, and 3584.
A picture of the scalable programming model was shown
in :numref:`figscalprogmodel`.

All threads execute the same code, defined by the kernel.
The builtin variable ``threadIdx``

* identifies every thread in a block uniquely; and
* defines the data processed by the thread.

The builtin variable ``blockDim`` holds the number
of threads in a block.
In a one dimensional organization, we use only ``threadIdx.x``
and ``blockDim.x``.  For 2D and 3D, the other components

* ``threadIdx.y`` belongs to the range 0 .. ``blockDim.y``;
* ``threadIdx.z`` belongs to the range 0 .. ``blockDim.z``.

The grid consists of *N* blocks, 
with :math:`{\tt blockIdx.x} \in \{0,N-1\}`.
Within each block,
:math:`{\tt threadIdx.x} \in \{ 0, {\tt blockDim.x}-1\}`.
The organization of the data for each thread is
in :numref:`figdata4thread`.

.. _figdata4thread:

.. figure:: ./figdata4thread.png
    :align: center

    Data mapped to threads with block and thread indices.

Suppose the kernel is defined by the function ``F``
with input arguments ``x`` and output arguments ``y``,
then the execution configuration parameters are set as below:

::

   dim3 dimGrid(128,1,1);
   dim3 dimBlock(32,1,1);
   F<<<dimGrid,dimBlock>>>(x,y);

which launches a grid of 128 blocks.  The grid is a one dimensional array.
Each block in the grid is also one dimensional and has 32 threads.

\begin{frame}{multidimensional thread organization}

The limitations of the Tesla C2050/C2070 are as follows:

* Maximum number of threads per block: 1,024.

* Maximum sizes of each dimension of a block:
  :math:`1,024 \times 1,024 \times 64`.

  Because 1,024 is the upper limit for the number
  of threads in a block, the largest square 2D block is
  :math:`32 \times 32`, as :math:`32^2 = 1,024`.

* Maximum sizes of each dimension of a grid:
  :math:`65,535 \times 65,535 \times 65,535`.

  65,535 is the upper limit for the builtin variables
  ``gridDim.x``, ``gridDim.y``, and ``gridDim.z``.
   
On the K20C and the P100, the limitations are as follows:

* Maximum number of threads per block: 1,024.

* Maximum dimension size of a thread block: 
  :math:`1,024 \times 1,024 \times 64`.

* Maximum dimension size of a grid size:
  :math:`2,147,483,647 \times 65,535 \times 65,535`.

Consider the following 3D example.
Suppose the function ``F`` defines the kernel,
with argument ``x``, then

::

   dim3 dimGrid(3,2,4);
   dim3 dimBlock(5,6,2);
   F<<<dimGrid,dimBlock>>>(x);

launches a grid with

* :math:`3 \times 2 \times 4` blocks; and

* each block has :math:`5 \times 6 \times 2` threads.

Matrix Matrix Multiplication
----------------------------

With a three dimensional grid we can define submatrices.
Consider for example a grid of dimension :math:`2 \times 2 \times 1`
to store a 4-by-4 matrix in tiles of 
dimensions :math:`2 \times 2 \times 1`, as in :numref:`figsubmatrixgrid`.

.. _figsubmatrixgrid:

.. figure:: ./figsubmatrixgrid.png
    :align: center

    Storing a tiled matrix in a grid.

A kernel launch with a grid of 
dimensions :math:`2 \times 2 \times 1`
where each block has dimensions :math:`2 \times 2 \times 1` 
creates 16 threads.
The mapping of the entries in the matrix to threads
is illustrated in :numref:`figsubmatrixgridmap`.

.. _figsubmatrixgridmap:

.. figure:: ./figsubmatrixgridmap.png
    :align: center

    Mapping threads to entries in the matrix.

A kernel launch with a grid of 
dimensions :math:`2 \times 2 \times 1`
where each block has dimensions :math:`2 \times 2 \times 1`
creates 16 threads.
The linear address calculation is illustrated
in :numref:`figsubmatrixgridaddress`.

.. _figsubmatrixgridaddress:

.. figure:: ./figsubmatrixgridaddress.png
    :align: center

    Linear address calculation for threads and submatrices.

The main function in the 
CUDA code to organized the threads is listed below.

::

   int main ( int argc, char* argv[] )
   {
      const int xb = 2; /* gridDim.x */
      const int yb = 2; /* gridDim.y */
      const int zb = 1; /* gridDim.z */
      const int xt = 2; /* blockDim.x */
      const int yt = 2; /* blockDim.y */
      const int zt = 1; /* blockDim.z */
      const int n = xb*yb*zb*xt*yt*zt;

      printf("allocating array of length %d...\n",n);

      /* allocating and initializing on the host */

      int *xhost = (int*)calloc(n,sizeof(int));
      for(int i=0; i<n; i++) xhost[i] = -1.0;

      /* copy to device and kernel launch */

      int *xdevice;
      size_t sx = n*sizeof(int);
      cudaMalloc((void**)&xdevice,sx);
      cudaMemcpy(xdevice,xhost,sx,cudaMemcpyHostToDevice);

      /* set the execution configuration for the kernel */

      dim3 dimGrid(xb,yb,zb);
      dim3 dimBlock(xt,yt,zt);
      matrixFill<<<dimGrid,dimBlock>>>(xdevice);


The kernel is defined in the code below.

::

   __global__ void matrixFill ( int *x )
   /*
    * Fills the matrix using blockIdx and threadIdx. */
   {
      int bx = blockIdx.x;
      int by = blockIdx.y;
      int tx = threadIdx.x;
      int ty = threadIdx.y;
      int row = by*blockDim.y + ty;
      int col = bx*blockDim.x + tx;
      int dim = gridDim.x*blockDim.x;
      int i = row*dim + col;
      x[i] = i; 
   }

Then the main program continues with the
copying to host and writing the result.

::

      /* copy data from device to host */
      cudaMemcpy(xhost,xdevice,sx,cudaMemcpyDeviceToHost);
      cudaFree(xdevice);

      int *p = xhost;
      for(int i1=0; i1 < xb; i1++)
        for(int i2=0; i2 < yb; i2++)
          for(int i3=0; i3 < zb; i3++)
            for(int i4=0; i4 < xt; i4++)
              for(int i5=0; i5 < yt; i5++)
                 for(int i6=0; i6 < zt; i6++)
                   printf("x[%d][%d][%d][%d][%d][%d] = %d\n",
                           i1,i2,i3,i4,i5,i6,*(p++));
      return 0;
   }

In a block all threads run independently.
CUDA allows threads in the same block to coordinate
their activities using a barrier synchronization function:
``__syncthreads()``.
The thread executing ``__syncthreads()`` will be held
at the calling location in the code
until every thread in the block reaches the location.
Placing a ``__syncthreads()`` ensures that all threads
in a block have completed a task before moving on.

Consider the tiled matrix multiplication,
as shown in :numref:`figsubmatrixtiledmul`.

.. _figsubmatrixtiledmul:

.. figure:: ./figsubmatrixtiledmul.png
    :align: center

    Tiled matrix multiplication with shared memory.

With tiled matrix matrix multiplication using shared memory, 
all threads in the block collaborate to copy the tiles 
:math:`A_{i,k}` and :math:`B_{k,j}` from global memory to shared memory.
Here is the need for thread synchronization.
Before the calculation of the inner products, all threads must
finish their copy statement: they all execute the ``__syncthreads()``.
Every thread computes one inner product.
After this computation, another synchronization is needed.
Before moving on to the next tile, all threads must
finish, therefore, they all execute the ``__syncthreads()`` after
computing their inner product and moving on to the next phase.

Let us then revisit the kernel of ``matrixMul``
and consider the code below.

::

   template <int BLOCK_SIZE> __global__ void
   matrixMul( float* C, float* A, float* B, int wA, int wB)
   {
       int bx = blockIdx.x;   // Block index
       int by = blockIdx.y;
       int tx = threadIdx.x;  // Thread index
       int ty = threadIdx.y;
       // Index of the first sub-matrix of A processed by the block
       int aBegin = wA * BLOCK_SIZE * by;
       // Index of the last sub-matrix of A processed by the block
       int aEnd   = aBegin + wA - 1;
       // Step size used to iterate through the sub-matrices of A
       int aStep  = BLOCK_SIZE;
       // Index of the first sub-matrix of B processed by the block
       int bBegin = BLOCK_SIZE * bx;
       // Step size used to iterate through the sub-matrices of B
       int bStep  = BLOCK_SIZE * wB;

       // Csub is used to store the element of the block sub-matrix
       // that is computed by the thread
       float Csub = 0;

       // Loop over all the sub-matrices of A and B
       // required to compute the block sub-matrix
       for (int a = aBegin, b = bBegin;
                a <= aEnd;
                a += aStep, b += bStep) {

           // Declaration of the shared memory array As used to
           // store the sub-matrix of A
           __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

           // Declaration of the shared memory array Bs used to
           // store the sub-matrix of B
           __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

           // Load the matrices from device memory
           // to shared memory; each thread loads
           // one element of each matrix
           AS(ty, tx) = A[a + wA * ty + tx];
           BS(ty, tx) = B[b + wB * ty + tx];

           // Synchronize to make sure the matrices are loaded
           __syncthreads();

           // Multiply the two matrices together;
           // each thread computes one element
           // of the block sub-matrix
   #pragma unroll
           for (int k = 0; k < BLOCK_SIZE; ++k)
               Csub += AS(ty, k) * BS(k, tx);

           // Synchronize to make sure that the preceding
           // computation is done before loading two new
           // sub-matrices of A and B in the next iteration
           __syncthreads();
       }

       // Write the block sub-matrix to device memory;
       // each thread writes one element
       int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
       C[c + wB * ty + tx] = Csub;
   }

Bibliography
------------

1. NVIDIA CUDA Programming Guide.
   Available at <http://developer.nvdia.com>.

2. Vasily Volkov and James W. Demmel:
   **Benchmarking GPUs to tune dense linear algebra.**
   In *Proceedings of the 2008 ACM/IEEE conference on Supercomputing*.
   IEEE Press, 2008.  Article No. 31.

Exercises
---------

1. Investigate the performance for the matrix-matrix multiplication
   with PyCUDA, comparing with the ``numpy`` implementation.

2. Find the limitations of the grid and block sizes
   for the graphics card on your laptop or desktop.

3. Extend the simple code with the three dimensional 
   thread organization to a tiled matrix-vector multiplication
   for numbers generated at random as 0 or 1.