Domain Decomposition Methods
============================

THe method of Jacobi is an interative method which is not in place:
we do not overwrite the current solution with new components as soon
as these become available.  In contrast, the method of Gauss-Seidel
does update the current solution with newly computed components of
the solution as soon as these are computed.

Domain decomposition methods to solve partial differential equations
are another important class of synchronized parallel computations,
explaining the origin for the need to solve large linear systems.
This chapter ends with an introduction to the software PETSc,
the Portable, Extensible Toolkit for Scientific Computation.

Gauss-Seidel Relaxation
-----------------------

The method of Gauss-Seidel is
an iterative method for solving linear systems.
We want to solve :math:`A {\bf x} = {\bf b}` 
for a very large dimension *n*.
Writing the method of Jacobi componentwise:

.. math::

   x_i^{(k+1)} := x_i^{(k)} + \frac{1}{a_{i,i}}
   \left( b_i - \sum_{j=1}^n a_{i,j} x_j^{(k)} \right), \quad
   i=1,2,\ldots,n

We observe that we can already 
use :math:`x_j^{(k+1)}` for :math:`j < i`.
This leads to the following formulas

.. math::

   x_i^{(k+1)} := x_i^{(k)} + \frac{1}{a_{i,i}}
   \left( b_i - \sum_{j=1}^{i-1} a_{i,j} x_j^{(k+1)}
              - \sum_{j=i}^n a_{i,j} x_j^{(k)} \right), \quad
   i=1,2,\ldots,n.

C code for the Gauss-Seidel method is below.

::

   void run_gauss_seidel_method
    ( int n, double **A, double *b, double epsilon, int maxit, int *numit, double *x )
   /*
    * Runs the  method of Gauss-Seidel for A*x = b.
    *
    * ON ENTRY :
    *   n        the dimension of the system;
    *   A        an n-by-n matrix A[i][i] /= 0;
    *   b        an n-dimensional vector;
    *   epsilon  accuracy requirement;
    *   maxit    maximal number of iterations;
    *   x        start vector for the iteration.
    *
    * ON RETURN :
    *   numit    number of iterations used;
    *   x        approximate solution to A*x = b. */
   {
      double *dx = (double*) calloc(n,sizeof(double));
      int i,j,k;
      for(k=0; k<maxit; k++)
      {
         double sum = 0.0;
         for(i=0; i<n; i++)
         {
            dx[i] = b[i];
            for(j=0; j<n; j++)
               dx[i] -= A[i][j]*x[j]; 
            dx[i] /= A[i][i]; x[i] += dx[i];
            sum += ( (dx[i] >= 0.0) ? dx[i] : -dx[i]);
         }
         printf("%4d : %.3e\n",k,sum);
         if(sum <= epsilon) break;
      }
      *numit = k+1; free(dx);
   }

Running on the same example as in the previous chapter goes much faster:

::

   $ time /tmp/gauss_seidel 1000
      0 : 1.264e+03
      1 : 3.831e+02
      2 : 6.379e+01
      3 : 1.394e+01
      4 : 3.109e+00
      5 : 5.800e-01
      6 : 1.524e-01
      7 : 2.521e-02
      8 : 7.344e-03
      9 : 1.146e-03
     10 : 3.465e-04
     11 : 5.419e-05
   computed 12 iterations     <----- 8407 with Jacobi
   error : 1.477e-05
   
   real    0m0.069s           <----- 0m42.411s
   user    0m0.063s           <----- 0m42.377s
   sys     0m0.005s           <----- 0m0.028s

Parallel Gauss-Seidel with OpenMP
---------------------------------

Using ``p`` threads:

::

   void run_gauss_seidel_method
    ( int p, int n, double **A, double *b, double epsilon, int maxit, int *numit, double *x )
   {
      double *dx;
      dx = (double*) calloc(n,sizeof(double));
      int i,j,k,id,jstart,jstop;

      int dnp = n/p;
      double dxi;

      for(k=0; k<maxit; k++)
      {
         double sum = 0.0;
         for(i=0; i<n; i++)
         {
            dx[i] = b[i];
            #pragma omp parallel shared(A,x) private(id,j,jstart,jstop,dxi)
            {
               id = omp_get_thread_num();
               jstart = id*dnp;
               jstop = jstart + dnp;
               dxi = 0.0;
               for(j=jstart; j<jstop; j++)
                  dxi += A[i][j]*x[j]; 
               #pragma omp critical
                  dx[i] -= dxi;
            }
         }
      }
   }

Observe that although the entire matrix ``A`` is shared between
all threads, each threads needs only ``n/p`` columns of the matrix.
In the MPI version of the method of Jacobi, entire rows of the matrix
were distributed among the processors.  If we were to make a distributed
memory version of the OpenMP code, then we would distribute entire columns
of the matrix ``A`` over the processors.

Running times obtained via the command ``time`` are
in :numref:`tabtimegsomp`.

.. _tabtimegsomp:

.. table:: Times of a parallel Gauss-Seidel with OpenMP

   +-----+--------+-----------+-----------+--------+---------+
   | *p* | *n*    | real      | user      | sys    | speedup |
   +=====+========+===========+===========+========+=========+
   | 1   | 10,000 |    7.165s |    6.921s | 0.242s |         |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 20,000 |   28.978s |   27.914s | 1.060s |         |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 30,000 | 1m 6.491s | 1m 4.139s | 2.341s |         |
   +-----+--------+-----------+-----------+--------+---------+
   | 2   | 10,000 |  4.243s   |    7.621s | 0.310s |  1.689  |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 20,000 | 16.325s   |   29.556s | 1.066s |  1.775  |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 30,000 | 36.847s   | 1m 6.831s | 2.324s |  1.805  |
   +-----+--------+-----------+-----------+--------+---------+
   | 5   | 10,000 |  2.415s   |    9.440s | 0.420s |  2.967  |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 20,000 |  8.403s   |   32.730s | 1.218s |  3.449  |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 30,000 | 18.240s   | 1m 11.031s| 2.327s |  3.645  |
   +-----+--------+-----------+-----------+--------+---------+
   |10   | 10,000 |  2.173s   |   16.241s | 0.501s |  3.297  |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 20,000 |  6.524s   |   45.629s | 1.521s |  4.442  |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 30,000 | 13.273s   | 1m 29.687s| 2.849s |  5.010  |
   +-----+--------+-----------+-----------+--------+---------+

Solving the Heat Equation
-------------------------

We will be applying a time stepping method to the
heat or diffusion equation:

.. math::

   \frac{\partial^2 u}{\partial x^2}
   + \frac{\partial^2 u}{\partial y^2}
   = \frac{\partial u}{\partial t}

models the temperature distribution :math:`u(x,y,t)`
evolving in time :math:`t` for :math:`(x,y)` in some domain.

Related Partial Differential Equations (PDEs) are

.. math::

   \frac{\partial^2 u}{\partial x^2}
   + \frac{\partial^2 u}{\partial y^2} = 0
   \quad {\rm and} \quad
   \frac{\partial^2 u}{\partial x^2}
   + \frac{\partial^2 u}{\partial y^2} = f(x,y),

respectively called the Laplace and Poisson equations.

For the discretization of the derivatives, consider
that at a point :math:`(x_0,y_0,t_0)`, we have

.. math::

   \left. \frac{\partial u}{\partial x} \right|_{(x_0,y_0,t_0)}
   = \lim_{h \rightarrow 0} ~
     \underbrace{\frac{u(x_0+h,y_0,t_0) - u(x_0,y_0,h)}{h}}_{u_x(x_0,y_0,t_0)}

so for positive :math:`h \approx 0`,
:math:`\displaystyle u_x(x_0,y_0,t_0) \approx
\left. \frac{\partial u}{\partial x} \right|_{(x_0,y_0,t_0)}`.

For the second derivative we use
the finite difference :math:`u_{xx}(x_0,y_0,t_0)`

.. math::

   \begin{array}{rcl}
   \!\!\!  & = & {\displaystyle \frac{1}{h} \left(
       \frac{u(x_0 + h,y_0,t_0) - u(x_0,y_0,t_0)}{h} 
             - \frac{u(x_0,y_0,t_0) - u(x_0-h,y_0,t_0)}{h} \right)} \\
   \!\!\!  & = & {\displaystyle
     \frac{u(x_0 + h,y_0,t_0) - 2 u(x_0,y_0,t_0) + u(x_0-h,y_0,t_0)}{h^2}}.
   \end{array}


Time stepping is then done along the formulas:`

.. math::

   \begin{array}{rcl}
     u_t(x_0,y_0,t_0)
     & = & {\displaystyle \frac{u(x_0,y_0,t_0+h) - u(x_0,y_0,t_0)}{h}} \\
     u_{xx}(x_0,y_0,t_0)
     & = & {\displaystyle \frac{u(x_0+h,y_0,t_0) - 2 u(x_0,y_0,t_0)
             + u(x_0-h,y_0,t_0)}{h^2}} \\
     u_{yy}(x_0,y_0,t_0)
     & = & {\displaystyle \frac{u(x_0,y_0+h,t_0) - 2 u(x_0,y_0,t_0)
             + u(x_0,y_0-h,t_0)}{h^2}}
   \end{array}

Then the equation :math:`\displaystyle \frac{\partial u}{\partial t} =
\frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2}`
becomes

.. math::

   \begin{array}{rcl}
   u(x_0,y_0,t_0+h) & \!\! = \!\! & u(x_0,y_0,t_0) \\
   & \!\! + \!\! 
   & {\displaystyle h ~ \left[ ~ u(x_0+h,y_0,t_0) + u(x_0-h,y_0,t_0) \right.} \\
   & \!\! + \!\! 
   & ~~~~ {\displaystyle \left. u(x_0,y_0+h,t_0) + u(x_0,y_0-h,t_0) 
         - 4 u(x_0,y_0,t_0) ~ \right]}
   \end{array}

Locally, the error of this approximation is :math:`O(h^2)`.

The algorithm performs synchronous iterations on a grid.
For :math:`(x,y) \in [0,1] \times [0,1]`, 
the division of :math:`[0,1]` in *n* equal subintervals,
with :math:`h = 1/n`, leads to 
a grid :math:`(x_i = ih, y_j = jh)`,
for :math:`i=0,1,\ldots,n` and :math:`j=0,1,\ldots,n`.
For *t*, we use the same step size *h*: :math:`t_k = kh`.
Denote :math:`u_{i,j}^{(k)} = u(x_i,y_j,t_k)`, then

.. math::

   u_{i,j}^{(k+1)} = u_{i,j}^{(k)} 
    +  h ~ \left[ ~ u_{i+1,j}^{(k)} + u_{i-1,j}^{(k)}
    +  u_{i,j+1}^{(k)} + u_{i,j-1}^{(k)} 
         - 4 u_{i,j}^{(k)} ~ \right].

:numref:`figuijstencil` shows the labeling of the grid points.

.. _figuijstencil:

.. figure:: ./figuijstencil.png
    :align: center

    In every step, we update :math:`u_{i,j}` based on :math:`u_{i-1,j}`,
    :math:`u_{i+1,j}`, :math:`u_{i,j-1}`, and :math:`u_{i,j+1}`.

We divide the grid in red and black points,
as in :numref:`figredblackpoints`.

.. _figredblackpoints:

.. figure:: ./figredblackpoints.png
    :align: center

    Organization of the grid :math:`u_{i,j}` in red and black points.

The computation is organized in two phases:

1.  In the first phase,
    update all black points simultaneously; and then

2.  in the second phase,
    update all red points simultaneously.

domain decomposition
^^^^^^^^^^^^^^^^^^^^

.. index:: strip partition, square partition

We can decompose a domain in strips,
but then there are $n/p$ boundaries that must be shared.
To reduce the overlapping, we partition in squares,
as shown in :numref:`figsquarepartition`.

.. _figsquarepartition:

.. figure:: ./figsquarepartition.png
    :align: center

    Partitioning of the grid in squares.

Then the boundary elements are proportional to :math:`n/\sqrt{p}`.

In :numref:`figsquarepartition`, two rows and two columns are shared
between two partitions.  To reduce the number of shared rows and
columns to one, we can take an odd number of rows and columns.
In the example of :numref:`figsquarepartition`, instead of 12 rows
and columns, we could take 11 or 13 rows and columns.  Then only
the middle row and column is shared between the partitions.

Comparing communication costs, we make the following observations.
In a square partition, every square has 4 edges,
whereas a strip has only 2 edges.  For the communication cost,
we multiply by 2 because for every send there is a receive.
Comparing the communication cost for a strip partitioning

.. math::

   t_{\rm comm}^{\rm strip}
   = 4 \left( t_{\rm startup} + n t_{\rm data} \right)

to the communication cost for a square partitioning 
(for :math:`p \geq 9`):

.. math::

   t_{\rm comm}^{\rm square}
   = 8 \left( t_{\rm startup} + \frac{n}{\sqrt{p}} t_{\rm data} \right).

A strip partition is best if the startup time is large
and if we have only very few processors.

If the startup time is low, and for :math:`p \geq 4`, 
a square partition starts to look better.

Solving the Heat Equation with PETSc
------------------------------------

The acronym PETSc 
stands for Portable, Extensible Toolkit for Scientific Computation.
PETSc provides data structures and routines 
for large-scale application codes
on parallel (and serial) computers, using MPI.
It supports Fortran, C, C++, Python, and MATLAB (serial) and
is free and open source, available at
<http://www.mcs.anl.gov/petsc/>.

PETSc is installed on ``kepler`` in the directory
``/home/jan/Downloads/petsc-3.7.4``.
The source code contains the directory ``ts``
on time stepping methods.  We will run ``ex3`` from
``/home/jan/Downloads/petsc-3.7.4/src/ts/examples/tutorials``
After making the code, a sequential run goes like

::

   $ /tmp/ex3 -draw_pause 1

The entry in a ``makefile`` to compile ``ex3``:

::

   PETSC_DIR=/usr/local/petsc-3.7.4
   PETSC_ARCH=arch-linux2-c-debug

   ex3:
           $(PETSC_DIR)/$(PETSC_ARCH)/bin/mpicc ex3.c \
           -I$(PETSC_DIR)/include \
           $(PETSC_DIR)/$(PETSC_ARCH)/lib/libpetsc.so \
           $(PETSC_DIR)/$(PETSC_ARCH)/lib/libopa.so \
           $(PETSC_DIR)/$(PETSC_ARCH)/lib/libfblas.a \
           $(PETSC_DIR)/$(PETSC_ARCH)/lib/libflapack.a \
           /usr/lib64/libblas.so.3 \
           -L/usr/X11R6/lib -lX11 -lgfortran -o /tmp/ex3

On a Mac OS X laptop, the ``PETSC_ARCH`` will be
``PETSC_ARCH=arch-darwin-c-debug``.

Bibliography
------------

1. S. Balay, J. Brown, K. Buschelman, V. Eijkhout, W. Gropp, D.~Kaushik,
   M. Knepley, L. Curfman McInnes, B. Smith, and H.~Zhang.
   *PETSc Users Manual. Revision 3.2*.
   Mathematics and Computer Science Division,
   Argonne National Laboratory, September 2011.

2. Ronald F. Boisvert, L. A. Drummond, Osni A. Marques:
   Introduction to the special issue on the Advanced CompuTational 
   Software (ACTS) collection.
   *ACM TOMS* 31(3):281--281, 2005.
   Special issue on the Advanced CompuTational Software (ACTS) Collection.

Exercises
---------

1. Take the running times of the OpenMP version of the method of
   Gauss-Seidel and compute the efficiency for each of the 9 cases.
   What can you conclude about the scalability?

2. Use MPI to write a parallel version of the method of Gauss-Seidel.
   Compare the speedups with the OpenMP version.

3. Run an example of the PETSc tutorials collection
   with an increasing number of processes to investigate the speedup.