Four Sample Questions ===================== The four questions in this lecture are representative for some of the topics covered in the course. Scaled Speedup -------------- Benchmarking of a program running on a 12-processor machine shows that 5% of the operations are done sequentially, i.e.: that 5% of the time only one single processor is working while the rest is idle. Compute the scaled speedup. The formula for scaled speedup is :math:`\displaystyle S_s(p) \leq \frac{st + p(1-s)t}{t} = s + p(1-s) = p + (1-p) s`. Evaluating this formula for :math:`s = 0.05, p = 12` yields .. math:: S_s(12) = 12 + (1-12) 0.05 = 11.45. Network Topologies ------------------ Show that a hypercube network topology has enough connections for a fan-in gathering of results. Consider :numref:`figfaninreview`, which illustrates the fan-in algorithm for 8 nodes. .. _figfaninreview: .. figure:: ./figfaninreview.png :align: center Fanning in the result for 8 nodes. For the example in :numref:`figfaninreview`, three steps are executed: 1. 001 :math:`\rightarrow` 000; 011 :math:`\rightarrow` 010; 101 :math:`\rightarrow` 100; 111 :math:`\rightarrow` 110 2. 010 :math:`\rightarrow` 000; 110 :math:`\rightarrow` 100 3. 100 :math:`\rightarrow` 000 To show a hypercube network has sufficiently many connections for the fan-in algorithm, we proceed via a proof by induction. * The base case: we verified for 1, 2, 4, and 8 nodes. * Assume we have enough connections for a :math:`2^k` hypercube. We need to show that we have enough connections for a :math:`2^{k+1}` hypercube: 1. In the first :math:`k` steps: * node 0 gathers from nodes :math:`1, 2, \ldots 2^k-1`; * node :math:`2^k` gathers from nodes :math:`2^k+1, 2^k+2, \ldots,2^{k+1} - 1`. 2. In step :math:`k+1`: node :math:`2^k` can send to node 0, because only one bit in :math:`2^k` is different from 0. Task Graph Scheduling --------------------- Given are two vectors :math:`{\bf x}` and :math:`{\bf y}`, both of length :math:`n`, with :math:`x_i \not= x_j` for all :math:`i \not= j`. Consider the code below: :: for i from 2 to n do for j from i to n do y[j] = (y[i-1] - y[j])/(x[i-1] - x[j]) 1. Define the task graph for a parallel computation of :math:`{\bf y}`. 2. Do a critical path analysis on the graph to determine the upper limit of the speedup. For :math:`n=4`, the numbers are in the table below: .. math:: \begin{array}{ccc} {\displaystyle y_2 = \frac{y_1 - y_2}{x_1 - x_2}} & & \\ \vspace{-2mm} \\ {\displaystyle y_3 = \frac{y_1 - y_3}{x_1 - x_3}} & {\displaystyle y_3 = \frac{y_2 - y_3}{x_2 - x_3}} & \\ \vspace{-2mm} \\ {\displaystyle y_4 = \frac{y_1 - y_4}{x_1 - x_4}} & {\displaystyle y_4 = \frac{y_2 - y_4}{x_2 - x_4}} & {\displaystyle y_4 = \frac{y_3 - y_4}{x_3 - x_4}} \\ \end{array} If the computations happen row by row, then there is no parallelism. Observe that the elements in each column can be computed independently from each other. Label the computation on row :math:`i` and column :math:`j` by :math:`T_{i,j}` and consider the graph in :numref:`figtaskgraphreview`. .. _figtaskgraphreview: .. figure:: ./figtaskgraphreview.png :align: center The task graph for :math:`n=4`. For :math:`n = 4`, with 3 processors, it takes three steps to compute the table. The speedup is 6/3 = 2. Each path leading to :math:`T_{4,4}` has two edges or three nodes. So, the length of the critical path is 2. For any :math:`n`, with :math:`n-1` processors, it takes :math:`n-1` steps, leading to a speedup of :math:`n(n-1)/2 \times 1/(n-1) = n/2`. Compute Bound or Memory Bound ----------------------------- A kernel performs 36 floating-point operations and seven 32-bit global memory accesses per thread. Consider two GPUs :math:`A` and :math:`B`, with the following properties: * :math:`A` has peak FLOPS of 200 GFLOPS and 100 GB/second as peak memory bandwidth; * :math:`B` has peak FLOPS of 300 GFLOPS and 250 GB/second as peak memory bandwidth. For each GPU, is the kernel compute bound or memory bound? The CGMA ratio of the kernel is :math:`\displaystyle \frac{36}{7 \times 4} = \frac{36}{28} = \frac{9}{7} \frac{\mbox{operations}}{\mbox{byte}}`. Taking the ratio of the peak performance and peak memory bandwidth of GPU :math:`A` gives :math:`200/100 = 2` operations per byte. As :math:`9/7 < 2`, the kernel is memory bound on GPU :math:`A`. Alternatively, it takes GPU :math:`A` per thread :math:`\displaystyle \frac{36}{200 \times 2^{30}}` seconds for the operations and :math:`\displaystyle \frac{28}{100 \times 2^{30}}` seconds for the memory transfers. As :math:`0.18 < 0.28`, more time is spent on transfers than on operations. For GPU :math:`B`, the ratio is :math:`300/250 = 6/5` operations per byte. As :math:`9/7 > 6/5`, the kernel is compute bound on GPU :math:`B`. Alternatively, it takes GPU :math:`B` per thread :math:`\displaystyle \frac{36}{300 \times 2^{30}}` seconds for the operations and :math:`\displaystyle \frac{28}{250 \times 2^{30}}` seconds for the memory transfers. As :math:`0.12 > 0.112`, more time is spent on computations than on transfers.