Solving Triangular Systems
==========================

We apply a type 3 pipeline to solve a triangular linear system.
Rewriting the formulas for the forward substitution
we arrive at an implementation for shared memory computers with OpenMP.
For accurate results, we compute with quad double arithmetic.

Forward Substitution Formulas
-----------------------------

The LU factorization of a matrix :math:`A` reduces the solving
of a linear system to solving two triangular systems.
To solve an *n*-dimensional linear system :math:`A {\bf x} = {\bf b}`
we factor *A* as a product of two triangular matrices,
:math:`A = L U`:

* :math:`L` is lower triangular, :math:`L = [\ell_{i,j}]`,
  :math:`\ell_{i,j} = 0` if :math:`j > i`
  and :math:`\ell_{i,i} = 1`.

* :math:`U` is upper triangular :math:`U = [u_{i,j}]`,
  :math:`u_{i,j} = 0` if :math:`i > j`.

Solving :math:`A {\bf x} = {\bf b}` is equivalent 
to solving :math:`L(U {\bf x}) = {\bf b}`:

1. Forward substitution: :math:`L {\bf y} = {\bf b}`.

2. Backward substitution: :math:`U {\bf x} = {\bf y}`.

Factoring :math:`A` costs :math:`O(n^3)`,
solving triangular systems costs :math:`O(n^2)`.

Expanding the matrix-vector product :math:`L {\bf y}`
in :math:`L {\bf y} = {\bf b}` 
leads to formulas for forward substitution:

.. math::

   \left\{
      \begin{array}{lcr}
         y_1 & = & b_1 \\
         \ell_{2,1} y_1 + y_2 & = & b_2 \\
         \ell_{3,1} y_1 + \ell_{3,2} y_2 + y_3 & = & b_3 \\
           & \vdots & \\
         \ell_{n,1} y_1 + \ell_{n,2} y_2 + \ell_{n,3} y_3
                       + \cdots + \ell_{n,n-1} y_{n-1} + y_n & = & b_n
      \end{array}
   \right.

and solving for the diagonal elements gives

.. math::

   \begin{array}{lcl}
      y_1 & = & b_1 \\
      y_2 & = & b_2 - \ell_{2,1} y_1 \\
      y_3 & = & b_3 - \ell_{3,1} y_1 - \ell_{3,2} y_2 \\
       & \vdots & \\
      y_n & = & b_n - \ell_{n,1} y_1 - \ell_{n,2} y_2
                    - \cdots - \ell_{n,n-1} y_{n-1} \\
   \end{array}

The formulas lead to an algorithm.
For :math:`k=1,2,\ldots,n`:

.. math::

   y_k = b_k - \sum_{i=1}^{k-1} \ell_{k,i} y_i.

Formulated as an algorithm, in pseudocode:

::

   for k from 1 to n do
       y[k] := b[k]
       for i from 1 to k-1 do
           y[k] := y[k] - L[k][i]*y[i].

We count :math:`1 + 2 + \cdots + n-1 = \displaystyle \frac{n(n-1)}{2}`
multiplications and subtractions.

Pipelines are classified into three types:

.. index:: type 1 pipeline

1. Type 1: Speedup only if multiple instances.
   Example: instruction pipeline.

.. index:: type 2 pipeline

2. Type 2: Speedup already if one instance.  Example: pipeline sorting.

.. index:: type 3 pipeline

3. Type 3: Worker continues after passing information through.
   Example: solve :math:`L {\bf y} = {\bf b}`.

Typical for the 3rd type of pipeline is the varying length
of each job, as exemplified in :numref:`figvarypipe`.

.. _figvarypipe:

.. figure:: ./figvarypipe.png
    :align: center

    Space-time diagram for pipeline with stages of varying length.

Parallel Solving
----------------

Using an *n*-stage pipeline,
we assume that *L* is available on every processor. 

.. math:

   \begin{array}{ll}
      {\rm for~} n = 4 = p: &
      y_1 := b_1 \\
    & y_2 := b_2 - \ell_{2,1} \star y_1 \\
    & y_3 := b_3 - \ell_{3,1} \star y_1 - \ell_{3,2} \star y_2 \\
    & y_4 := b_4 - \ell_{4,1} \star y_1 - \ell_{4,2} \star y_2
                 - \ell_{4,3} \star y_3
   \end{array}

The corresponding 4-stage pipeline is shown in :numref:`figlowerpipe`
with the space-time diagram in :numref:`figlowerdiagram`.

.. _figlowerpipe:

.. figure:: ./figlowerpipe.png
    :align: center

    4-stage pipeline to solve a 4-by-4 lower triangular system.

.. _figlowerdiagram:

.. figure:: ./figlowerdiagram.png
    :align: center

    Space-time diagram for solving a 4-by-4 lower triangular system.

In type 3 pipelining, a worker continues after passing results through.
The making of :math:`y_1` available in the next pipeline cycle
is illustrated in :numref:`figtype3pipe`.
The corresponding space-time diagram is
in :numref:`figtype3diagram` and the space-time diagram
in :numref:`figtype3data` shows 
at what time step which component of the solution is.

.. _figtype3pipe:

.. figure:: ./figtype3pipe.png
    :align: center

    Passing :math:`y_1` through the type 3 pipeline.

.. _figtype3diagram:

.. figure:: ./figtype3diagram.png
    :align: center

    Space-time diagram of a type 3 pipeline.

.. _figtype3data:

.. figure:: ./figtype3data.png
    :align: center

    Space-time diagram illustrates the component of the solutions.

We count the steps for :math:`p = 4` or in general, 
for :math:`p = n` as follows.
The latency takes 4 steps for :math:`y_1` to be at :math:`P_4`,
or in general: *n* steps for :math:`y_1` to be at :math:`P_n`. 
It takes then 6 additional steps for :math:`y_4`
to be computed by :math:`P_4`,
or in general: :math:`2n-2` additional steps 
for :math:`y_n` to be computed by :math:`P_n`.
So it takes :math:`n + 2n - 2 = 3n - 2` steps to solve 
an *n*-dimensional triangular system by an *n*-stage pipeline.

::

    y := b
    for i from 2 to n do
        for j from i to n do
            y[j] := y[j] - L[j][i-1]*y[i-1]

Consider for example the
solving of :math:`L {\bf y} = {\bf b}` for :math:`n = 5`.

1. :math:`y := {\bf b}`;

2. :math:`y_2 := y_2 - \ell_{2,1} \star y_1`;

   :math:`y_3 := y_3 - \ell_{3,1} \star y_1`;

   :math:`y_4 := y_4 - \ell_{4,1} \star y_1`;

   :math:`y_5 := y_5 - \ell_{5,1} \star y_1`;

3. :math:`y_3 := y_3 - \ell_{3,2} \star y_2`;

   :math:`y_4 := y_4 - \ell_{4,2} \star y_2`;

   :math:`y_5 := y_5 - \ell_{5,2} \star y_2`;

4. :math:`y_4 := y_4 - \ell_{4,3} \star y_3`;

   :math:`y_5 := y_5 - \ell_{5,3} \star y_3`;

5. :math:`y_5 := y_5 - \ell_{5,4} \star y_4`;

Observe that
all instructions in the *j* loop are independent from each other!

Consider the inner loop in the algorithm to solve :math:`L {\bf y} = {\bf b}`.
We distribute the update of :math:`y_i, y_{i+1}, \ldots, y_n`
among *p* processors.
If :math:`n \gg p`, then we expect a close to optimal speedup.

Parallel Solving with OpenMP
----------------------------

For our parallel solver for triangular systems,
the setup is as follows.
For :math:`L = [ \ell_{i,j} ]`, we generate random
numbers for :math:`\ell_{i,j} \in [0,1]`.
The exact solution :math:`\bf y`: :math:`y_i = 1`,
for :math:`i=1,2,\ldots,n`.
We compute the right hand side :math:`{\bf b} = L {\bf y}`.

For dimensions :math:`n > 800`,
hardware doubles are insufficient.
With hardware doubles, the accumulation of round off
is such that we loose all accuracy in :math:`y_n`.
We recall that condition numbers of *n*-dimensional 
triangular systems can grow as large as :math:`2^n`.
Therefore, we use :index:`quad double` arithmetic.

Relying on hardware doubles is problematic:

::

   $ time /tmp/trisol 10
   last number : 1.0000000000000009e+00

   real   0m0.003s    user   0m0.001s    sys   0m0.002s

   $ time /tmp/trisol 100
   last number : 9.9999999999974221e-01

   real   0m0.005s    user   0m0.001s    sys   0m0.002s

   $ time /tmp/trisol 1000
   last number : 2.7244600009080568e+04

   real   0m0.036s    user   0m0.025s    sys   0m0.009s

Allocating a matrix of quad double in the main program
is done as follows:

::

   {
      qd_real b[n],y[n];
      int i,j;

      qd_real **L;
      L = (qd_real**) calloc(n,sizeof(qd_real*));
      for(i=0; i<n; i++)
         L[i] = (qd_real*) calloc(n,sizeof(qd_real));

      srand(time(NULL));
      random_triangular_system(n,L,b);

The function for a random triangular system is defined next:

::

   void random_triangular_system ( int n, qd_real **L, qd_real *b )
   {
      int i,j;
      for(i=0; i<n; i++)
      {
         L[i][i] = 1.0;
         for(j=0; j<i; j++)
         {
            double r = ((double) rand())/RAND_MAX;
            L[i][j] = qd_real(r);
         }
         for(j=i+1; j<n; j++)
            L[i][j] = qd_real(0.0);
      }
      for(i=0; i<n; i++)
      {
         b[i] = qd_real(0.0);
         for(j=0; j<n; j++)
            b[i] = b[i] + L[i][j];
      }
   }

Solving the system is done by the function below.

::

   void solve_triangular_system_swapped ( int n, qd_real **L, qd_real *b, qd_real *y )
   {
      int i,j;

      for(i=0; i<n; i++) y[i] = b[i];
        
      for(i=1; i<n; i++)
      {
         for(j=i; j<n; j++)
            y[j] = y[j] - L[j][i-1]*y[i-1];
      }
   }

Now we insert the OpenMP directives:

.. index:: shared clause

::

   void solve_triangular_system_swapped ( int n, qd_real **L, qd_real *b, qd_real *y )
   {
      int i,j;

      for(i=0; i<n; i++) y[i] = b[i];
     
      for(i=1; i<n; i++)
      {
         #pragma omp parallel shared(L,y) private(j)
         {
            #pragma omp for
            for(j=i; j<n; j++)
               y[j] = y[j] - L[j][i-1]*y[i-1];
         }
      }
   }

Running ``time /tmp/trisol_qd_omp n p``,
for dimension :math:`n = 8,000` for varying number *p* of cores
gives times as in table :numref:`tabtimestrisolqdomp`.

.. _tabtimestrisolqdomp:

.. table:: Times on solving a triangular system with quad doubles.

   +-----+----------+---------+---------+--------+
   | *p* | cpu time | real    | user    | sys    |
   +=====+==========+=========+=========+========+
   | 1   | 21.240s  | 35.095s | 34.493s | 0.597s |
   +-----+----------+---------+---------+--------+
   | 2   | 22.790s  | 25.237s | 36.001s | 0.620s |
   +-----+----------+---------+---------+--------+
   | 4   | 22.330s  | 19.433s | 35.539s | 0.633s |
   +-----+----------+---------+---------+--------+
   | 8   | 23.200s  | 16.726s | 36.398s | 0.611s |
   +-----+----------+---------+---------+--------+
   | 12  | 23.260s  | 15.781s | 36.457s | 0.626s |
   +-----+----------+---------+---------+--------+

The serial part is the generation of the random numbers for *L*
and the computation of :math:`{\bf b} = L {\bf y}`.
Recall Amdahl's Law.

We can compute the serial time, subtracting for :math:`p=1`,
from the real time the cpu time spent in the solver,
i.e.: :math:`35.095 - 21.240 = 13.855`.

For :math:`p = 12`, time spent on the solver is
:math:`15.781 - 13.855 = 1.926`.  Compare 1.926 to 21.240/12 = 1.770.

Exercises
---------

1. Consider the upper triangular system :math:`U {\bf x} = {\bf y}`,
   with :math:`U = [u_{i,j}]`, :math:`u_{i,j} = 0` if :math:`i > j`.
   Derive the formulas and general algorithm to
   compute the components of the solution :math:`{\bf x}`.
   For :math:`n=4`, draw the third type of pipeline.

2. Write a parallel solver with OpenMP to solve :math:`U {\bf x} = {\bf y}`.
   Take for :math:`U` a matrix with random numbers in :math:`[0,1]`,
   compute :math:`{\bf y}` so all components of :math:`{\bf x}` equal one.  
   Test the speedup of your program,
   for large enough values of :math:`n` and a varying number of cores.

3. Describe a parallel solver for upper triangular systems 
   :math:`U {\bf y} = {\bf b}` for distributed memory computers.
   Write a prototype implementation using MPI
   and discuss its scalability.