Four Sample Questions
=====================

The four questions in this lecture are representative
for some of the topics covered in the course.

Scaled Speedup
--------------

Benchmarking of a program running on a 12-processor machine shows 
that 5% of the operations are done sequentially, 
i.e.: that 5% of the time only one single processor is working 
while the rest is idle. 

Compute the scaled speedup.

The formula for scaled speedup is 
:math:`\displaystyle S_s(p) \leq \frac{st + p(1-s)t}{t}
= s + p(1-s) = p + (1-p) s`.
Evaluating this formula for :math:`s = 0.05, p = 12` yields

.. math::

   S_s(12) = 12 + (1-12) 0.05 = 11.45.

Network Topologies
------------------

Show that a hypercube network topology has enough connections
for a fan-in gathering of results.

Consider :numref:`figfaninreview`, which illustrates
the fan-in algorithm for 8 nodes.

.. _figfaninreview:

.. figure:: ./figfaninreview.png
    :align: center

    Fanning in the result for 8 nodes.

For the example in :numref:`figfaninreview`,
three steps are executed:

1. 001 :math:`\rightarrow` 000;
   011 :math:`\rightarrow` 010;
   101 :math:`\rightarrow` 100;
   111 :math:`\rightarrow` 110

2. 010 :math:`\rightarrow` 000;
   110 :math:`\rightarrow` 100

3. 100 :math:`\rightarrow` 000

To show a hypercube network has sufficiently many connections
for the fan-in algorithm, we proceed via a proof by induction.

* The base case: we verified for 1, 2, 4, and 8 nodes.

* Assume we have enough connections for a :math:`2^k` hypercube.

  We need to show that we have enough connections
  for a :math:`2^{k+1}` hypercube:

  1. In the first :math:`k` steps: 

     * node 0 gathers from nodes :math:`1, 2, \ldots 2^k-1`;

     * node :math:`2^k` gathers from nodes
       :math:`2^k+1, 2^k+2, \ldots,2^{k+1} - 1`.

  2. In step :math:`k+1`: node :math:`2^k` can send to node 0,
     because only one bit in :math:`2^k` is different from 0.

Task Graph Scheduling
---------------------

Given are two vectors :math:`{\bf x}` and :math:`{\bf y}`,
both of length :math:`n`,
with :math:`x_i \not= x_j` for all :math:`i \not= j`.
Consider the code below:

::

    for i from 2 to n do
        for j from i to n do
            y[j] = (y[i-1] - y[j])/(x[i-1] - x[j])

1. Define the task graph for a parallel computation of :math:`{\bf y}`.

2. Do a critical path analysis on the graph to determine
   the upper limit of the speedup.

For :math:`n=4`, the numbers are in the table below:

.. math::

   \begin{array}{ccc}
     {\displaystyle y_2 = \frac{y_1 - y_2}{x_1 - x_2}} & & \\
   \vspace{-2mm} \\
     {\displaystyle y_3 = \frac{y_1 - y_3}{x_1 - x_3}} & 
     {\displaystyle y_3 = \frac{y_2 - y_3}{x_2 - x_3}} & \\
   \vspace{-2mm} \\
     {\displaystyle y_4 = \frac{y_1 - y_4}{x_1 - x_4}} & 
     {\displaystyle y_4 = \frac{y_2 - y_4}{x_2 - x_4}} &
     {\displaystyle y_4 = \frac{y_3 - y_4}{x_3 - x_4}} \\
   \end{array}

If the computations happen row by row, then there is no parallelism.
Observe that the elements in each column can be computed
independently from each other.  
Label the computation on row :math:`i`
and column :math:`j` by :math:`T_{i,j}`
and consider the graph in :numref:`figtaskgraphreview`.

.. _figtaskgraphreview:

.. figure:: ./figtaskgraphreview.png
    :align: center

    The task graph for :math:`n=4`.

For :math:`n = 4`,
with 3 processors, it takes three steps to compute the table.
The speedup is 6/3 = 2.  Each path leading to :math:`T_{4,4}` has
two edges or three nodes.
So, the length of the critical path is 2.

For any :math:`n`, with :math:`n-1` processors,
it takes :math:`n-1` steps,
leading to a speedup of :math:`n(n-1)/2 \times 1/(n-1) = n/2`.

Compute Bound or Memory Bound
-----------------------------

A kernel performs 36 floating-point operations and
seven 32-bit global memory accesses per thread.

Consider two GPUs :math:`A` and :math:`B`, with the following properties:

* :math:`A` has peak FLOPS of 200 GFLOPS 
  and 100 GB/second as peak memory bandwidth;

* :math:`B` has peak FLOPS of 300 GFLOPS 
  and 250 GB/second as peak memory bandwidth.

For each GPU, is the kernel compute bound or memory bound?

The CGMA ratio of the kernel is
:math:`\displaystyle \frac{36}{7 \times 4} = \frac{36}{28}
= \frac{9}{7} \frac{\mbox{operations}}{\mbox{byte}}`.

Taking the ratio of the peak performance and peak memory bandwidth
of GPU :math:`A` gives :math:`200/100 = 2` operations per byte.
As :math:`9/7 < 2`, the kernel is memory bound on GPU :math:`A`.

Alternatively, it takes GPU :math:`A` per thread
:math:`\displaystyle \frac{36}{200 \times 2^{30}}`
seconds for the operations and 
:math:`\displaystyle \frac{28}{100 \times 2^{30}}` 
seconds for the memory transfers.
As :math:`0.18 < 0.28`, more time is spent on transfers than on operations.

For GPU :math:`B`, the ratio is :math:`300/250 = 6/5` operations per byte.
As :math:`9/7 > 6/5`, the kernel is compute bound on GPU :math:`B`.

Alternatively, it takes GPU :math:`B` per thread
:math:`\displaystyle \frac{36}{300 \times 2^{30}}`
seconds for the operations and 
:math:`\displaystyle \frac{28}{250 \times 2^{30}}`
seconds for the memory transfers.
As :math:`0.12 > 0.112`, more time is spent on computations 
than on transfers.