Parallel Gaussian Elimination
=============================

LU and Cholesky Factorization
-----------------------------

Consider as given a square *n*-by-*n* matrix *A*
and corresponding right hand side vector :math:`\bf b` of length *n*.
To solve an *n*-dimensional linear system :math:`A {\bf x} = {\bf b}`
we factor *A* as a product of two triangular matrices,
:math:`A = L U`:

1. *L* is lower triangular, :math:`L = [\ell_{i,j}]`,
   :math:`\ell_{i,j} = 0` if :math:`j > i` and :math:`\ell_{i,i} = 1`.

2. *U* is upper triangular :math:`U = [u_{i,j}]`,
   :math:`u_{i,j} = 0` if :math:`i > j`.

Solving :math:`A {\bf x} = {\bf b}` is then equivalent to 
solving :math:`L(U {\bf x}) = {\bf b}` in two stages:

1. Forward substitution: :math:`L {\bf y} = {\bf b}`.

2. Backward substitution: :math:`U {\bf x} = {\bf y}`.

Factoring *A* costs :math:`O(n^3)`,
solving triangular systems costs :math:`O(n^2)`.
For numerical stability, we apply partial pivoting and compute 
:math:`PA = LU`, where *P* is a permutation matrix.

The steps in the LU factorization of the matrix *A* are shown
in :numref:`figlufacsteps`.

.. _figlufacsteps:

.. figure:: ./figlufacsteps.png
    :align: center

    Steps in the LU factorization.

For column :math:`j = 1,2,\ldots,n-1` in *A* do

1. find the largest element :math:`a_{i,j}`
   in column :math:`j` (for :math:`i \geq j`);

2. if :math:`i \not= j`, then swap rows :math:`i` and :math:`j`;

3. for :math:`i = j+1, \ldots n`, for :math:`k=j+1,\ldots,n` do 
   :math:`\displaystyle a_{i,k} := a_{i,k} 
   - \left( \frac{a_{i,j}}{a_{j,j}} \right) a_{j,k}`. 

If :math:`A` is symmetric, :math:`A^T = A`, and 
positive semidefinite: :math:`\forall {\bf x}: {\bf x}^T A {\bf x} \geq 0`,
then we better compute a Cholesky factorization: :math:`A = L L^T`,
where :math:`L` is a lower triangular matrix.
Because :math:`A` is positive semidefinite, no pivoting is needed,
and we need about half as many operations as LU.

Pseudo code is listed below:

::

   for j from 1 to n do 
       for k from 1 to j-1 do
           a[j][j] := a[j][j] - a[j][k]**2
       a[j][j] := sqrt(a[j][j]) 
       for i from j+1 to n do
           for k from 1 to j do
               a[i][j] := a[i][j] - a[i][k]*a[j][k]
           a[i][j] := a[i][j]/a[j][j]

Let *A* be a symmetric, positive definite *n*-by-*n* matrix.
In preparation for parallel implementation, we consider tiled matrices.
For tile size *b*, let :math:`n = p \times b` and consider

.. math::

   A = \left[
     \begin{array}{cccc}
         A_{1,1} & A_{2,1} & \cdots & A_{p,1} \\
         A_{2,1} & A_{2,2} & \cdots & A_{p,2} \\
          \vdots & \vdots  & \ddots & \vdots \\
         A_{p,1} & A_{p,2} & \cdots & A_{p,p}
     \end{array}
   \right], 

where :math:`A_{i,j}` is an *b*-by-*b* matrix.

A crude classification of memory hierarchies distinguishes
between registers (small), cache (medium), and main memory (large).
To reduce data movements, we want to keep data in registers
and cache as much as possible.

Pseudo code for a tiled Cholesky factorization is listed below.

::

   for k from 1 to p do 
       DPOTF2(A[k][k], L[k][k])                 # L[k][k] := Cholesky(A[k][k]) 
       for i from k+1 to p do
           DTRSM(L[k][k], A[i][k], L[i][k])     # L[i][k] := A[i][k]*L[k][k]**(-T) 
       end for
       for i from k+1 to p do
           for j from k+1 to p do 
               DGSMM(L[i][k], L[j][k], A[i][j]) # A[i][j] := A[i][j] - L[i][k]*L[j][k]}**T
           end for
   end for

Blocked LU Factorization
------------------------

In deriving blocked formulations of LU,
consider a 3-by-3 blocked matrix.
The optimal size of the blocks is machine dependent.

.. math::

   \left[
      \begin{array}{ccc}
         A_{1,1} & A_{1,2} & A_{1,3} \\
         A_{2,1} & A_{2,2} & A_{2,3} \\
         A_{3,1} & A_{3,2} & A_{3,3}
      \end{array}
   \right]
   =
   \left[
      \begin{array}{ccc}
         L_{1,1} &         &  \\
         L_{2,1} & L_{2,2} & \\
         L_{3,1} & L_{3,2} & L_{3,3}
      \end{array}
   \right]
   \left[
      \begin{array}{ccc}
         U_{1,1} & U_{1,2} & U_{1,3} \\
                 & U_{2,2} & U_{2,3} \\
                 &         & U_{3,3}
      \end{array}
   \right]

Expanding the right hand side and equating to the matrix at the left
gives formulations for the LU factorization.

.. math::

   \begin{array}{lll}
      A_{1,1} = L_{1,1} U_{1,1}
      & A_{1,2} = L_{1,1} U_{1,2} & A_{1,3} = L_{1,1} U_{1,3} \\
      A_{2,1} = L_{2,1} U_{1,1}
      & A_{2,2} = L_{2,1} U_{1,2} + L_{2,2} U_{2,2}
      & A_{2,3} = L_{2,1} U_{1,3} + L_{2,2} U_{2,3} \\
      A_{3,1} = L_{3,1} U_{1,1}
      & A_{3,2} = L_{3,1} U_{1,2} + L_{3,2} U_{2,2}
      & A_{3,3} = L_{3,1} U_{1,3} + L_{3,2} U_{2,3} + L_{3,3} U_{3,3} \\
   \end{array}

The formulas we are deriving gives rise to the right looking LU.
We store the :math:`L_{i,j}`'s and :math:`U_{i,j}`'s
in the original matrix:

.. math::

     \left[
        \begin{array}{ccc}
           A_{1,1} & A_{1,2} & A_{1,3} \\
           A_{2,1} & A_{2,2} & A_{2,3} \\
           A_{3,1} & A_{3,2} & A_{3,3}
        \end{array}
     \right]
   = \left[
        \begin{array}{ccc}
           L_{1,1} &   &   \\
           L_{2,1} & I &   \\
           L_{3,1} &   & I
        \end{array}
     \right]
     \left[
        \begin{array}{ccc}
           U_{1,1} & U_{1,2} & U_{1,3} \\
                   & B_{2,2} & B_{2,3} \\
                   & B_{3,2} & B_{3,3}
        \end{array}
     \right]

The matrices :math:`B_{i,j}`'s are obtained after a first block LU step.
To find :math:`L_{2,2}`, :math:`L_{3,2}`, and :math:`U_{2,2}` we use

.. math::

   \left\{
      \begin{array}{l}
         A_{2,2} = L_{2,1} U_{1,2} + L_{2,2} U_{2,2} \\
         A_{3,2} = L_{3,1} U_{1,2} + L_{3,2} U_{2,2} 
      \end{array}
   \right.
   \quad {\rm and} \quad
   \left\{
      \begin{array}{l}
         A_{2,2} = L_{2,1} U_{1,2} + B_{2,2} \\
         A_{3,2} = L_{3,1} U_{1,2} + B_{3,2}
      \end{array}
   \right.

Eliminating :math:`A_{2,2} - L_{2,1} U_{1,2}`
        and :math:`A_{3,2} - L_{3,1} U_{1,2}` gives

.. math::

   \left\{
      \begin{array}{c}
         B_{2,2} = L_{2,2} U_{2,2} \\
         B_{3,2} = L_{3,2} U_{2,2}
      \end{array}
   \right.

Via LU on :math:`B_{2,2}` we obtain :math:`L_{2,2}` 
and :math:`U_{2,2}`.
Then: :math:`L_{3,2} := B_{3,2} U_{2,2}^{-1}`.

The formulas we derived are similar to the scalar case
and are called *right looking*.

But we may organize the LU factorization differently,
as shown in :numref:`figrightleftlooking`.

.. _figrightleftlooking:

.. figure:: ./figrightleftlooking.png
    :align: center

    Right and left looking organization of tiled LU factorization.

What is good looking?  Left is best for data access.
We derive left looking formulas.
Going from :math:`P_1 A` to :math:`P_2 P_1 A`:

.. math::

    \left[
       \begin{array}{ccc}
          L_{1,1} & \!\!   & \!\!   \\
          L_{2,1} & \!\! I & \!\!   \\
          L_{3,1} & \!\!   & \!\! I 
       \end{array}
    \right]
    \left[
       \begin{array}{ccc}
          U_{1,1} & \!\! A_{1,2} & \!\! A_{1,3} \\
                  & \!\! A_{2,2} & \!\! A_{2,3} \\
                  & \!\! A_{3,2} & \!\! A_{3,3} 
       \end{array}
   \right]
   \rightarrow
      \left[
         \begin{array}{ccc}
            L_{1,1} & \!\!         & \!\!   \\
            L_{2,1} & \!\! L_{2,2} & \!\!   \\
            L_{3,1} & \!\! L_{3,2} & \!\! I
         \end{array}
      \right]
      \left[
         \begin{array}{ccc}
            U_{1,1} & \!\! U_{1,2} & \!\! A_{1,3} \\
                    & \!\! U_{2,2} & \!\! A_{2,3} \\
                    & \!\!         & \!\! A_{3,3} 
         \end{array}
      \right]

We keep the original :math:`A_{i,j}`'s and postpone updating to the right.

1. We get :math:`U_{1,2}` via :math:`A_{1,2} = L_{1,1} U_{1,2}`
   and compute :math:`U_{1,2} = L_{1,1}^{-1} A_{1,2}`.

2. To compute :math:`L_{2,2}` and :math:`L_{3,2}` do
   :math:`\displaystyle
   \left[
   \begin{array}{c}
   B_{2,2} \\ B_{3,2}
   \end{array}
   \right]
   = 
   \left[
   \begin{array}{c}
   A_{2,2} \\ A_{3,2}
   \end{array}
   \right]
   -
   \left[
   \begin{array}{c}
   L_{2,1} \\ L_{3,1}
   \end{array}
   \right]
   U_{1,2}`;

   and factor :math:`\displaystyle
   P_2 \left[
   \begin{array}{c}
   B_{2,2} \\ B_{3,2}
   \end{array}
   \right]
   = \left[
   \begin{array}{c}
   L_{2,2} \\ L_{3,2}
   \end{array}
   \right] U_{2,2}` as before.

   Replace :math:`\displaystyle
   \left[
   \begin{array}{c}
   A_{2,3} \\ A_{3,3}
   \end{array}
   \right]
   := P_2 \left[
   \begin{array}{c}
   A_{2,3} \\ A_{3,3}
   \end{array}
   \right]`
   and :math:`\displaystyle
   \left[
   \begin{array}{c}
   L_{2,1} \\ L_{3,1}
   \end{array}
   \right]
   := P_2 \left[
   \begin{array}{c}
   L_{2,1} \\ L_{3,1}
   \end{array}
   \right]`.

Pseudo code for the tiled algorithm for LU factorization is below.

::

   for k from 1 to p do
       DGETF(A[k][k], L[k][k], U[k][k], P[k][k])
       for j from k+1 to p do
           DGESSM(A[k][j], L[k][k], P[k][k], U[k][j])
       for i from k+1 to p do
           DTSTRF(U[k][k], A[i][k], P[i][k])
           for j from k+1 to p do 
               DSSSM(U[k][j], A[i][j], L[i][k], P[i][k]) 
      
where the function calls correspond to the following formulas:

* ``DGETF(A[k][k], L[k][k], U[k][k], P[k][k])`` corresponds to
  :math:`L_{k,k}, U_{k,k}, P_{k,k} := \mbox{LU}(A_{k,k})`.

* ``DGESSM(A[k][j], L[k][k], P[k][k], U[k][j])`` corresponds to
  :math:`U_{k,j} := L_{k,k}^{-1} P_{k,k} A_{k,j}`.

* ``DTSTRF(U[k][k], A[i][k], P[i][k])`` corresponds to
  :math:`\displaystyle 
  U_{k,k}, L_{i,k}, P_{i,k}
  :=
  LU \left(
  \left[
  \begin{array}{c}
  U_{k,k} \\ A_{i,k}
  \end{array}
  \right] \right)`

* ``DSSSM(U[k][j], A[i][j], L[i][k], P[i][k])`` corresponds to
  :math:`\displaystyle 
  \left[
  \begin{array}{c}
  U_{k,j} \\ A_{i,j} 
  \end{array}
  \right] := 
  L_{i,k}^{-1} P_{i,k}
  \left[
  \begin{array}{c}
  U_{k,j} \\ A_{i,j}
  \end{array}
  \right]`.

The PLASMA Software Library
---------------------------

PLASMA stands for
Parallel Linear Algebra Software for Multicore Architectures.
The software uses FORTRAN and C.
It is designed for efficiency on homogeneous multicore processors
and multi-socket systems of multicore processors,
Built using a small set of sequential routines as building blocks,
referred to as *core BLAS*, and
free to download from <http://icl.cs.utk.edu/plasma>.

Capabilities and limitations:
Can solve dense linear systems and least squares problems.
Unlike LAPACK, PLASMA currently does not solve eigenvalue or
singular value problems and provide no support for band matrices.

Basic Linear Algebra Subprograms: BLAS

1. Level-1 BLAS: vector-vector operations, :math:`O(n)` cost.
   inner products, norms, :math:`{\bf x} \pm {\bf y}`,
   :math:`\alpha {\bf x} + {\bf y}`.

2. Level-2 BLAS: matrix-vector operations, :math:`O(mn)` cost.

   * :math:`{\bf y} = \alpha A {\bf x} + \beta {\bf y}`

   * :math:`A =  A  + \alpha {\bf x} {\bf y}^T`, rank one update

   * :math:{\bf x} = T^{-1} {\bf b}`, for :math:`T` a triangular matrix

3. Level-3 BLAS: matrix-matrix operations, :math:`O(kmn)` cost.

   * :math:`C = \alpha AB + \beta C`

   * :math:`C = \alpha AA^T + \beta C`,
     rank *k* update of symmetric matrix

   * :math:`B = \alpha T B`, for :math:`T` a triangular matrix

   * :math:`B = \alpha T^{-1} B`,
     solve linear system with many right hand sides

The execution is asynchronous and graph driven.
We view a blocked algorithm as a Directed Acyclic Graph (DAG):
nodes are computational tasks performed in kernel subroutines;
edges represent the dependencies among the tasks.
Given a DAG, tasks are scheduled asynchronously and independently,
considering the dependencies imposed by the edges in the DAG.
A critical path in the DAG connects those nodes that have the highest 
number of outgoing edges.
The scheduling policy assigns higher priority to those tasks 
that lie on the critical path.

The directory ``examples`` in
``/usr/local/plasma-installer_2.6.0/build/plasma_2.6.0``
contains 
``example_cposv``, an example for a Cholesky factorization
of a symmetric positive definite matrix.
Running ``make`` as defined in the ``examples`` directory:

::

   [root@kepler examples]# make example_cposv
   gcc -O2 -DADD_ -I../include -I../quark \
   -I/usr/local/plasma-installer_2.6.0/install/include -c \
   example_cposv.c -o example_cposv.o
   gfortran  example_cposv.o -o example_cposv -L../lib -lplasma \
   -lcoreblasqw -lcoreblas -lplasma -L../quark -lquark \
   -L/usr/local/plasma-installer_2.6.0/install/lib -lcblas \
   -L/usr/local/plasma-installer_2.6.0/install/lib -llapacke \
   -L/usr/local/plasma-installer_2.6.0/install/lib -ltmg -llapack \
   -L/usr/local/plasma-installer_2.6.0/install/lib -lrefblas \
   -lpthread -lm
   [root@kepler examples]#

Cholesky factorization with
the dimension equal to 10,000.

::

   [root@kepler examples]# time ./example_dposv
   -- PLASMA is initialized to run on 2 cores.
   ============
   Checking the Residual of the solution
   -- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||_oo).N.eps) = 2.710205e-03
   -- The solution is CORRECT !
   -- Run of DPOSV example successful !

   real    1m22.271s
   user    2m42.534s
   sys     0m1.064s
   [root@kepler examples]#

The wall clock times for Cholesky on dimension 10,000
are listed in :numref:`tabtimescholesky`.

.. _tabtimescholesky:

.. table:: Wall clock times for Cholesky factorization.

   +--------+------------+---------+
   | #cores | real time  | speedup |
   +========+============+=========+
   | 1      | 2m 40.913s |         |       
   +--------+------------+---------+
   | 2      | 1m 22.271s | 1.96    |
   +--------+------------+---------+
   | 4      | 42.621s    | 3.78    |
   +--------+------------+---------+
   | 8      | 23.480s    | 6.85    |
   +--------+------------+---------+
   | 16     | 12.647s    | 12.72   |
   +--------+------------+---------+

Bibliography
------------

1. E. Agullo, J. Dongarra, B. Hadri, J. Kurzak, J. Langou,
   H. Ltaief, P. Luszczek, and A. YarKhan.
   *PLASMA Users' Guide. Parallel Linear Algebra Software for
   Multicore Architectures. Version 2.0*.

2. A. Buttari, J. Langou, J. Kurzak, and J. Dongarra.
   A class of parallel tiled linear algebra algorithms for
   multicore architectures.
   *Parallel Computing* 35: 38-53, 2009.

Exercises
---------

1. Write your own parallel shared memory version of the Cholesky
   factorization, using OpenMP, Pthreads, or the Intel TBB.

2. Derive right looking LU factorization formulas with pivoting,
   i.e.: introducing permutation matrices *P*.
   Develop first the formulas for a 3-by-3 block matrix
   and then generalize the formulas into an algorithm for
   any *p*-by-*p* block matrix.

3. Take an appropriate example from the PLASMA installation
   to test the speedup of the multicore LU factorization routines.