Thread Organization and Matrix Multiplication
=============================================

In this lecture we look at the problem of multiplying two matrices,
from the perspective of the thread organization.

Thread Organization
-------------------

The code that runs on the GPU is defined in a function, the kernel.
A kernel launch creates a grid of blocks, and
each block has one or more threads.
The organization of the grids and blocks can be 1D, 2D, or 3D.

During the running of the kernel:

* Threads in the same block are executed simultaneously.
* Blocks are scheduled by the streaming multiprocessors.

The P100 has 56 Streaming Multiprocessors (SMs) and
threads are executed in groups of 32 (the warp size).
Each SM has 64 cores.
This implies: :math:`56 \times 64 = 3584` threads can run simultaneously.
The A100 has 108 SMs, also with 64 cores each.
A picture of the scalable programming model was shown
in :numref:`figscalprogmodel`.

All threads execute the same code, defined by the kernel.
The builtin variable ``threadIdx``

* identifies every thread in a block uniquely; and
* defines the data processed by the thread.

The builtin variable ``blockDim`` holds the number
of threads in a block.
In a one dimensional organization, we use only ``threadIdx.x``
and ``blockDim.x``.  For 2D and 3D, the other components

* ``threadIdx.y`` belongs to the range 0 .. ``blockDim.y``;
* ``threadIdx.z`` belongs to the range 0 .. ``blockDim.z``.

The grid consists of *N* blocks, 
with :math:`{\tt blockIdx.x} \in \{0,N-1\}`.
Within each block,
:math:`{\tt threadIdx.x} \in \{ 0, {\tt blockDim.x}-1\}`.
The organization of the data for each thread is
in :numref:`figdata4thread`.

.. _figdata4thread:

.. figure:: ./figdata4thread.png
    :align: center

    Data mapped to threads with block and thread indices.

Suppose the kernel is defined by the function ``F``
with input arguments ``x`` and output arguments ``y``,
then the execution configuration parameters are set as below:

::

   dim3 dimGrid(128,1,1);
   dim3 dimBlock(32,1,1);
   F<<<dimGrid,dimBlock>>>(x,y);

which launches a grid of 128 blocks.  The grid is a one dimensional array.
Each block in the grid is also one dimensional and has 32 threads.

\begin{frame}{multidimensional thread organization}

The limitations of the P100 and V100 are as follows:

* Maximum number of threads per block: 1,024.

* Maximum sizes of each dimension of a block:
  :math:`1,024 \times 1,024 \times 64`.

  Because 1,024 is the upper limit for the number
  of threads in a block, the largest square 2D block is
  :math:`32 \times 32`, as :math:`32^2 = 1,024`.

* Maximum sizes of each dimension of a grid:
  :math:`2,147,483,647 \times 65,535 \times 65,535`.

  2,147,483,647 is the upper limit for the builtin variable
  ``gridDim.x``, while
  65,535 is the upper limit for the builtin variables
  ``gridDim.y`` and ``gridDim.z``.
   
The same limitations apply for the A100.

Consider the following 3D example.
Suppose the function ``F`` defines the kernel,
with argument ``x``, then

::

   dim3 dimGrid(3,2,4);
   dim3 dimBlock(5,6,2);
   F<<<dimGrid,dimBlock>>>(x);

launches a grid with

* :math:`3 \times 2 \times 4` blocks; and

* each block has :math:`5 \times 6 \times 2` threads.

Matrix Matrix Multiplication
----------------------------

With a three dimensional grid we can define submatrices.
Consider for example a grid of dimension :math:`2 \times 2 \times 1`
to store a 4-by-4 matrix in tiles of 
dimensions :math:`2 \times 2 \times 1`, as in :numref:`figsubmatrixgrid`.

.. _figsubmatrixgrid:

.. figure:: ./figsubmatrixgrid.png
    :align: center

    Storing a tiled matrix in a grid.

A kernel launch with a grid of 
dimensions :math:`2 \times 2 \times 1`
where each block has dimensions :math:`2 \times 2 \times 1` 
creates 16 threads.
The mapping of the entries in the matrix to threads
is illustrated in :numref:`figsubmatrixgridmap`.

.. _figsubmatrixgridmap:

.. figure:: ./figsubmatrixgridmap.png
    :align: center

    Mapping threads to entries in the matrix.

A kernel launch with a grid of 
dimensions :math:`2 \times 2 \times 1`
where each block has dimensions :math:`2 \times 2 \times 1`
creates 16 threads.
The linear address calculation is illustrated
in :numref:`figsubmatrixgridaddress`.

.. _figsubmatrixgridaddress:

.. figure:: ./figsubmatrixgridaddress.png
    :align: center

    Linear address calculation for threads and submatrices.

The main function in the 
CUDA code to organized the threads is listed below.

::

   int main ( int argc, char* argv[] )
   {
      const int xb = 2; /* gridDim.x */
      const int yb = 2; /* gridDim.y */
      const int zb = 1; /* gridDim.z */
      const int xt = 2; /* blockDim.x */
      const int yt = 2; /* blockDim.y */
      const int zt = 1; /* blockDim.z */
      const int n = xb*yb*zb*xt*yt*zt;

      printf("allocating array of length %d...\n",n);

      /* allocating and initializing on the host */

      int *xhost = (int*)calloc(n,sizeof(int));
      for(int i=0; i<n; i++) xhost[i] = -1.0;

      /* copy to device and kernel launch */

      int *xdevice;
      size_t sx = n*sizeof(int);
      cudaMalloc((void**)&xdevice,sx);
      cudaMemcpy(xdevice,xhost,sx,cudaMemcpyHostToDevice);

      /* set the execution configuration for the kernel */

      dim3 dimGrid(xb,yb,zb);
      dim3 dimBlock(xt,yt,zt);
      matrixFill<<<dimGrid,dimBlock>>>(xdevice);


The kernel is defined in the code below.

::

   __global__ void matrixFill ( int *x )
   /*
    * Fills the matrix using blockIdx and threadIdx. */
   {
      int bx = blockIdx.x;
      int by = blockIdx.y;
      int tx = threadIdx.x;
      int ty = threadIdx.y;
      int row = by*blockDim.y + ty;
      int col = bx*blockDim.x + tx;
      int dim = gridDim.x*blockDim.x;
      int i = row*dim + col;
      x[i] = i; 
   }

Then the main program continues with the
copying to host and writing the result.

::

      /* copy data from device to host */
      cudaMemcpy(xhost,xdevice,sx,cudaMemcpyDeviceToHost);
      cudaFree(xdevice);

      int *p = xhost;
      for(int i1=0; i1 < xb; i1++)
        for(int i2=0; i2 < yb; i2++)
          for(int i3=0; i3 < zb; i3++)
            for(int i4=0; i4 < xt; i4++)
              for(int i5=0; i5 < yt; i5++)
                 for(int i6=0; i6 < zt; i6++)
                   printf("x[%d][%d][%d][%d][%d][%d] = %d\n",
                           i1,i2,i3,i4,i5,i6,*(p++));
      return 0;
   }

Submatrices with Threads in CUDA.jl
-----------------------------------

The equivalent computation in Julia,
on an NVIDIA GPU with CUDA.jl uses the kernel below:

::

   using CUDA

   """
       function matFill!(A)

   fills the array using the blockIdx and threadIdx.
   """
   function matFill!(A)
       bx = blockIdx().x - 1
       by = blockIdx().y - 1
       tx = threadIdx().x - 1
       ty = threadIdx().y - 1
       row = by*blockDim().y + ty 
       col = bx*blockDim().x + tx
       dim = gridDim().x*blockDim().x 
       idx = 1 + row*dim + col 
       A[idx] = idx
       return nothing
   end

Observe the ``- 1`` after the block and thread indices.
Similar to arrays in Julia starting at index ``1``,
the block and thread indices in CUDA.jl also start at ``1``.
For ``- 1`` happens then for the index calculation to be
the same as in the C code.

The kernel ``matFill!`` is launched as follows:

::

   xb = 2 # gridDim.x
   yb = 2 # gridDim.y
   zb = 1 # gridDim.z
   xt = 2 # blockDim.x
   yt = 2 # blockDim.y
   zt = 1 # blockDim.z

   dim = xb*yb*zb*xt*yt*zt
   A_h = zeros(dim)
   A_d = CuArray(A_h)

   @cuda threads=(xt, yt, zt) blocks=(xb, yb, zb) matFill!(A_d) 

   A_h = Array(A_d)
   println(A_d)
   println(A_h)

Thread Synchronization
----------------------

In a block all threads run independently.
CUDA allows threads in the same block to coordinate
their activities using a barrier synchronization function:
``__syncthreads()``.
The thread executing ``__syncthreads()`` will be held
at the calling location in the code
until every thread in the block reaches the location.
Placing a ``__syncthreads()`` ensures that all threads
in a block have completed a task before moving on.

Consider the tiled matrix multiplication,
as shown in :numref:`figsubmatrixtiledmul`.

.. _figsubmatrixtiledmul:

.. figure:: ./figsubmatrixtiledmul.png
    :align: center

    Tiled matrix multiplication with shared memory.

With tiled matrix matrix multiplication using shared memory, 
all threads in the block collaborate to copy the tiles 
:math:`A_{i,k}` and :math:`B_{k,j}` from global memory to shared memory.
Here is the need for thread synchronization.
Before the calculation of the inner products, all threads must
finish their copy statement: they all execute the ``__syncthreads()``.
Every thread computes one inner product.
After this computation, another synchronization is needed.
Before moving on to the next tile, all threads must
finish, therefore, they all execute the ``__syncthreads()`` after
computing their inner product and moving on to the next phase.

Let us then revisit the kernel of ``matrixMul``
and consider the code below.

::

   template <int BLOCK_SIZE> __global__ void
   matrixMul( float* C, float* A, float* B, int wA, int wB)
   {
       int bx = blockIdx.x;   // Block index
       int by = blockIdx.y;
       int tx = threadIdx.x;  // Thread index
       int ty = threadIdx.y;
       // Index of the first sub-matrix of A processed by the block
       int aBegin = wA * BLOCK_SIZE * by;
       // Index of the last sub-matrix of A processed by the block
       int aEnd   = aBegin + wA - 1;
       // Step size used to iterate through the sub-matrices of A
       int aStep  = BLOCK_SIZE;
       // Index of the first sub-matrix of B processed by the block
       int bBegin = BLOCK_SIZE * bx;
       // Step size used to iterate through the sub-matrices of B
       int bStep  = BLOCK_SIZE * wB;

       // Csub is used to store the element of the block sub-matrix
       // that is computed by the thread
       float Csub = 0;

       // Loop over all the sub-matrices of A and B
       // required to compute the block sub-matrix
       for (int a = aBegin, b = bBegin;
                a <= aEnd;
                a += aStep, b += bStep) {

           // Declaration of the shared memory array As used to
           // store the sub-matrix of A
           __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

           // Declaration of the shared memory array Bs used to
           // store the sub-matrix of B
           __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

           // Load the matrices from device memory
           // to shared memory; each thread loads
           // one element of each matrix
           AS(ty, tx) = A[a + wA * ty + tx];
           BS(ty, tx) = B[b + wB * ty + tx];

           // Synchronize to make sure the matrices are loaded
           __syncthreads();

           // Multiply the two matrices together;
           // each thread computes one element
           // of the block sub-matrix
   #pragma unroll
           for (int k = 0; k < BLOCK_SIZE; ++k)
               Csub += AS(ty, k) * BS(k, tx);

           // Synchronize to make sure that the preceding
           // computation is done before loading two new
           // sub-matrices of A and B in the next iteration
           __syncthreads();
       }

       // Write the block sub-matrix to device memory;
       // each thread writes one element
       int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
       C[c + wB * ty + tx] = Csub;
   }

Bibliography
------------

1. NVIDIA CUDA Programming Guide.
   Available at <http://developer.nvdia.com>.

2. Vasily Volkov and James W. Demmel:
   **Benchmarking GPUs to tune dense linear algebra.**
   In *Proceedings of the 2008 ACM/IEEE conference on Supercomputing*.
   IEEE Press, 2008.  Article No. 31.

Exercises
---------

1. Investigate the performance for the matrix-matrix multiplication
   with PyCUDA, comparing with the ``numpy`` implementation.

2. Find the limitations of the grid and block sizes
   for the graphics card on your laptop or desktop.

3. Extend the simple code with the three dimensional 
   thread organization to a tiled matrix-vector multiplication
   for numbers generated at random as 0 or 1.