Domain Decomposition Methods
============================

THe method of Jacobi is an interative method which is not in place:
we do not overwrite the current solution with new components as soon
as these become available.  In contrast, the method of Gauss-Seidel
does update the current solution with newly computed components of
the solution as soon as these are computed.

Domain decomposition methods to solve partial differential equations
are another important class of synchronized parallel computations,
explaining the origin for the need to solve large linear systems.
This chapter ends with an introduction to the software PETSc,
the Portable, Extensible Toolkit for Scientific Computation.

Gauss-Seidel Relaxation
-----------------------

The method of Gauss-Seidel is
an iterative method for solving linear systems.
We want to solve :math:`A {\bf x} = {\bf b}` 
for a very large dimension *n*.
Writing the method of Jacobi componentwise:

.. math::

   x_i^{(k+1)} := x_i^{(k)} + \frac{1}{a_{i,i}}
   \left( b_i - \sum_{j=1}^n a_{i,j} x_j^{(k)} \right), \quad
   i=1,2,\ldots,n

We observe that we can already 
use :math:`x_j^{(k+1)}` for :math:`j < i`.
This leads to the following formulas

.. math::

   x_i^{(k+1)} := x_i^{(k)} + \frac{1}{a_{i,i}}
   \left( b_i - \sum_{j=1}^{i-1} a_{i,j} x_j^{(k+1)}
              - \sum_{j=i}^n a_{i,j} x_j^{(k)} \right), \quad
   i=1,2,\ldots,n.


The method of Gauss-Seidel is an *in-place method*:
old values are overwritten by new ones as soon as computed.
In translating the formulas, we end up with two loops:

.. math::

   \begin{array}{l}
   \mbox{for } j \mbox{ from 1 to } i-1 \mbox{ do }
               \Delta x_i := \Delta x_i - a_{i,j} x_j^{(k+1)} \\
   \mbox{for } j \mbox{ from } i \mbox{ to } n \mbox{ do }
               \Delta x_i := \Delta x_i - a_{i,j} x_j^{(k)}
   \end{array}

The two loops are fused into one loop as done below:

.. math::

   \begin{array}{l}
   \mbox{for } j \mbox{ from 1 to } n \mbox{ do }
               \Delta x_i := \Delta x_i - a_{i,j} x_j
   \end{array}

C code for the Gauss-Seidel method is below.

::

   void run_gauss_seidel_method
    ( int n, double **A, double *b, double epsilon, int maxit, int *numit, double *x )
   /*
    * Runs the  method of Gauss-Seidel for A*x = b.
    *
    * ON ENTRY :
    *   n        the dimension of the system;
    *   A        an n-by-n matrix A[i][i] /= 0;
    *   b        an n-dimensional vector;
    *   epsilon  accuracy requirement;
    *   maxit    maximal number of iterations;
    *   x        start vector for the iteration.
    *
    * ON RETURN :
    *   numit    number of iterations used;
    *   x        approximate solution to A*x = b. */
   {
      double *dx = (double*) calloc(n,sizeof(double));
      int i,j,k;
      for(k=0; k<maxit; k++)
      {
         double sum = 0.0;
         for(i=0; i<n; i++)
         {
            dx[i] = b[i];
            for(j=0; j<n; j++)
               dx[i] -= A[i][j]*x[j]; 
            dx[i] /= A[i][i]; x[i] += dx[i];
            sum += ( (dx[i] >= 0.0) ? dx[i] : -dx[i]);
         }
         printf("%4d : %.3e\n",k,sum);
         if(sum <= epsilon) break;
      }
      *numit = k+1; free(dx);
   }

Running on the same example as in the previous chapter goes much faster:

::

   $ time /tmp/gauss_seidel 1000
      0 : 1.264e+03
      1 : 3.831e+02
      2 : 6.379e+01
      3 : 1.394e+01
      4 : 3.109e+00
      5 : 5.800e-01
      6 : 1.524e-01
      7 : 2.521e-02
      8 : 7.344e-03
      9 : 1.146e-03
     10 : 3.465e-04
     11 : 5.419e-05
   computed 12 iterations     <----- 8407 with Jacobi
   error : 1.477e-05
   
   real    0m0.069s           <----- 0m42.411s
   user    0m0.063s           <----- 0m42.377s
   sys     0m0.005s           <----- 0m0.028s

Parallel Gauss-Seidel with OpenMP
---------------------------------

The method of Jacobi is suitable for strip partitioning of the (dense) matrix
and in a parallel distributed memory implementation,
every processor can keep its own portion of the solution 
vector :math:`\bf x`.
The Gauss-Seidel method makes the new :math:`x_i` directly available
which leads to communication overhead on distributed memory computers.

In a parallel shared memory implementation, consider:

1. Threads compute inner products of matrix rows with :math:`\bf x`.

2. Each :math:`\Delta x_i` is updated in a critical section.

So, many threads compute one inner product.
For example, three threads,
assuming :math:`n` is divisible by 3, compute:

.. math::

   \left[ \vphantom{\frac{3}{2}} a_{i, 1} \cdots a_{i, n/3} \left| 
          \vphantom{\frac{2}{3}} a_{i, n/3+1} \cdots a_{i, 2n/3} \right|
          a_{i, 2n/3+1} \cdots a_{i, n} \right]
   \left[
      \begin{array}{c}
         x_1 \\ \vdots \\ x_{n/3} \\ \hline
         x_{n/3+1} \\ \vdots \\ x_{2n/3} \\ \hline
         x_{2n/3+1} \\ \vdots \\ x_{n} 
      \end{array}
   \right]

Each thread has its own variable to accumulate
its portion of the inner product.

Using ``p`` threads:

::

   void run_gauss_seidel_method
    ( int p, int n, double **A, double *b, double epsilon, int maxit, int *numit, double *x )
   {
      double *dx;
      dx = (double*) calloc(n,sizeof(double));
      int i,j,k,id,jstart,jstop;

      int dnp = n/p;
      double dxi;

      for(k=0; k<maxit; k++)
      {
         double sum = 0.0;
         for(i=0; i<n; i++)
         {
            dx[i] = b[i];
            #pragma omp parallel shared(A,x) private(id,j,jstart,jstop,dxi)
            {
               id = omp_get_thread_num();
               jstart = id*dnp;
               jstop = jstart + dnp;
               dxi = 0.0;
               for(j=jstart; j<jstop; j++)
                  dxi += A[i][j]*x[j]; 
               #pragma omp critical
                  dx[i] -= dxi;
            }
            dx[i] /= A[i][i];
            x[i] += dx[i];
            sum += ( (dx[i] >= 0.0) ? dx[i] : -dx[i]);
         }
         printf("%4d : %.3e\n",k,sum);
         if(sum <= epsilon) break;
      }
      *numit = k+1;
      free(dx);
   }

The update instructions

::

         dx[i] /= A[i][i];
         x[i] += dx[i];
         sum += ( (dx[i] >= 0.0) ? dx[i] : -dx[i]);

are executed after each parallel region.
This ensures the synchronization and the execution of
the stop test:

::

      if(sum <= epsilon) break;

Observe that although the entire matrix ``A`` is shared between
all threads, each threads needs only ``n/p`` columns of the matrix.
In the MPI version of the method of Jacobi, entire rows of the matrix
were distributed among the processors.  If we were to make a distributed
memory version of the OpenMP code, then we would distribute entire columns
of the matrix ``A`` over the processors.

Running times obtained via the command ``time``
on a 12-core Intel X5690 at 3.47 GHz, are in :numref:`tabtimegsomp`.

.. _tabtimegsomp:

.. table:: Times of a parallel Gauss-Seidel with OpenMP

   +-----+--------+-----------+-----------+--------+---------+
   | *p* | *n*    | real      | user      | sys    | speedup |
   +=====+========+===========+===========+========+=========+
   | 1   | 10,000 |    7.165s |    6.921s | 0.242s |         |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 20,000 |   28.978s |   27.914s | 1.060s |         |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 30,000 | 1m 6.491s | 1m 4.139s | 2.341s |         |
   +-----+--------+-----------+-----------+--------+---------+
   | 2   | 10,000 |  4.243s   |    7.621s | 0.310s |  1.689  |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 20,000 | 16.325s   |   29.556s | 1.066s |  1.775  |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 30,000 | 36.847s   | 1m 6.831s | 2.324s |  1.805  |
   +-----+--------+-----------+-----------+--------+---------+
   | 5   | 10,000 |  2.415s   |    9.440s | 0.420s |  2.967  |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 20,000 |  8.403s   |   32.730s | 1.218s |  3.449  |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 30,000 | 18.240s   | 1m 11.031s| 2.327s |  3.645  |
   +-----+--------+-----------+-----------+--------+---------+
   |10   | 10,000 |  2.173s   |   16.241s | 0.501s |  3.297  |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 20,000 |  6.524s   |   45.629s | 1.521s |  4.442  |
   +-----+--------+-----------+-----------+--------+---------+
   |     | 30,000 | 13.273s   | 1m 29.687s| 2.849s |  5.010  |
   +-----+--------+-----------+-----------+--------+---------+

Solving the Heat Equation
-------------------------

We will be applying a time stepping method to the
heat or diffusion equation:

.. math::

   \frac{\partial^2 u}{\partial x^2}
   + \frac{\partial^2 u}{\partial y^2}
   = \frac{\partial u}{\partial t}

models the temperature distribution :math:`u(x,y,t)`
evolving in time :math:`t` for :math:`(x,y)` in some domain.

Related Partial Differential Equations (PDEs) are

.. math::

   \frac{\partial^2 u}{\partial x^2}
   + \frac{\partial^2 u}{\partial y^2} = 0
   \quad {\rm and} \quad
   \frac{\partial^2 u}{\partial x^2}
   + \frac{\partial^2 u}{\partial y^2} = f(x,y),

respectively called the Laplace and Poisson equations.

For the discretization of the derivatives, consider
that at a point :math:`(x_0,y_0,t_0)`, we have

.. math::

   \left. \frac{\partial u}{\partial x} \right|_{(x_0,y_0,t_0)}
   = \lim_{h \rightarrow 0} ~
     \underbrace{\frac{u(x_0+h,y_0,t_0) - u(x_0,y_0,h)}{h}}_{u_x(x_0,y_0,t_0)}

so for positive :math:`h \approx 0`,
:math:`\displaystyle u_x(x_0,y_0,t_0) \approx
\left. \frac{\partial u}{\partial x} \right|_{(x_0,y_0,t_0)}`.

For the second derivative we use
the finite difference :math:`u_{xx}(x_0,y_0,t_0)`

.. math::

   \begin{array}{rcl}
   \!\!\!  & = & {\displaystyle \frac{1}{h} \left(
       \frac{u(x_0 + h,y_0,t_0) - u(x_0,y_0,t_0)}{h} 
             - \frac{u(x_0,y_0,t_0) - u(x_0-h,y_0,t_0)}{h} \right)} \\
   \!\!\!  & = & {\displaystyle
     \frac{u(x_0 + h,y_0,t_0) - 2 u(x_0,y_0,t_0) + u(x_0-h,y_0,t_0)}{h^2}}.
   \end{array}


Time stepping is then done along the formulas:`

.. math::

   \begin{array}{rcl}
     u_t(x_0,y_0,t_0)
     & = & {\displaystyle \frac{u(x_0,y_0,t_0+h) - u(x_0,y_0,t_0)}{h}} \\
     u_{xx}(x_0,y_0,t_0)
     & = & {\displaystyle \frac{u(x_0+h,y_0,t_0) - 2 u(x_0,y_0,t_0)
             + u(x_0-h,y_0,t_0)}{h^2}} \\
     u_{yy}(x_0,y_0,t_0)
     & = & {\displaystyle \frac{u(x_0,y_0+h,t_0) - 2 u(x_0,y_0,t_0)
             + u(x_0,y_0-h,t_0)}{h^2}}
   \end{array}

Then the equation :math:`\displaystyle \frac{\partial u}{\partial t} =
\frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2}`
becomes

.. math::

   \begin{array}{rcl}
   u(x_0,y_0,t_0+h) & \!\! = \!\! & u(x_0,y_0,t_0) \\
   & \!\! + \!\! 
   & {\displaystyle \frac{1}{h} ~ \left[ ~ u(x_0+h,y_0,t_0) + u(x_0-h,y_0,t_0) \right.} \\
   & \!\! + \!\! 
   & ~~~~ {\displaystyle \left. u(x_0,y_0+h,t_0) + u(x_0,y_0-h,t_0) 
         - 4 u(x_0,y_0,t_0) ~ \right]}
   \end{array}

Locally, the error of this approximation is :math:`O(h^2)`.

The algorithm performs synchronous iterations on a grid.
For :math:`(x,y) \in [0,1] \times [0,1]`, 
the division of :math:`[0,1]` in *n* equal subintervals,
with :math:`h = 1/n`, leads to 
a grid :math:`(x_i = ih, y_j = jh)`,
for :math:`i=0,1,\ldots,n` and :math:`j=0,1,\ldots,n`.
For *t*, we use the same step size *h*: :math:`t_k = kh`.
Denote :math:`u_{i,j}^{(k)} = u(x_i,y_j,t_k)`, then

.. math::

   u_{i,j}^{(k+1)} = u_{i,j}^{(k)} 
    +  \frac{1}{h} ~ \left[ ~ u_{i+1,j}^{(k)} + u_{i-1,j}^{(k)}
    +  u_{i,j+1}^{(k)} + u_{i,j-1}^{(k)} 
         - 4 u_{i,j}^{(k)} ~ \right].

:numref:`figuijstencil` shows the labeling of the grid points.

.. _figuijstencil:

.. figure:: ./figuijstencil.png
    :align: center

    In every step, we update :math:`u_{i,j}` based on :math:`u_{i-1,j}`,
    :math:`u_{i+1,j}`, :math:`u_{i,j-1}`, and :math:`u_{i,j+1}`.

We divide the grid in red and black points,
as in :numref:`figredblackpoints`.

.. _figredblackpoints:

.. figure:: ./figredblackpoints.png
    :align: center

    Organization of the grid :math:`u_{i,j}` in red and black points.

The computation is organized in two phases:

1.  In the first phase,
    update all black points simultaneously; and then

2.  in the second phase,
    update all red points simultaneously.

.. index:: strip partition, square partition

We can decompose a domain in strips,
but then there are $n/p$ boundaries that must be shared.
To reduce the overlapping, we partition in squares,
as shown in :numref:`figsquarepartition`.

.. _figsquarepartition:

.. figure:: ./figsquarepartition.png
    :align: center

    Partitioning of the grid in squares.

Then the boundary elements are proportional to :math:`n/\sqrt{p}`.

In :numref:`figsquarepartition`, two rows and two columns are shared
between two partitions.  To reduce the number of shared rows and
columns to one, we can take an odd number of rows and columns.
In the example of :numref:`figsquarepartition`, instead of 12 rows
and columns, we could take 11 or 13 rows and columns.  Then only
the middle row and column is shared between the partitions.

Comparing communication costs, we make the following observations.
In a square partition, every square has 4 edges,
whereas a strip has only 2 edges.  For the communication cost,
we multiply by 2 because for every send there is a receive.
Comparing the communication cost for a strip partitioning

.. math::

   t_{\rm comm}^{\rm strip}
   = 4 \left( t_{\rm startup} + n t_{\rm data} \right)

to the communication cost for a square partitioning 
(for :math:`p \geq 9`):

.. math::

   t_{\rm comm}^{\rm square}
   = 8 \left( t_{\rm startup} + \frac{n}{\sqrt{p}} t_{\rm data} \right).

A strip partition is best if the startup time is large
and if we have only very few processors.

If the startup time is low, and for :math:`p \geq 4`, 
a square partition starts to look better.


This subsection ends with some numerical considerations.
The discretization of the heat equation is the simplest one.

* The explicit forward difference method is conditionally stable:
  in order for the method to converge,
  the step size in time depends on the step size in space.

* Methods that are unconditionally stable are implicit and
  require the solving of a linear system in each time step.

Solving the Heat Equation with PETSc
------------------------------------

The acronym PETSc 
stands for Portable, Extensible Toolkit for Scientific Computation.
PETSc provides data structures and routines 
for large-scale application codes
on parallel (and serial) computers, using MPI.
It supports Fortran, C, C++, Python, and MATLAB (serial) and
is free and open source, available at
<https://petsc.org>.

Bibliography
------------

1. S. Balay, J. Brown, K. Buschelman, V. Eijkhout, W. Gropp, D. Kaushik,
   M. Knepley, L. Curfman McInnes, B. Smith, and H. Zhang.
   *PETSc Users Manual. Revision 3.2*.
   Mathematics and Computer Science Division,
   Argonne National Laboratory, September 2011.

2. Ronald F. Boisvert, L. A. Drummond, Osni A. Marques:
   Introduction to the special issue on the Advanced CompuTational 
   Software (ACTS) collection.
   *ACM TOMS* 31(3):281--281, 2005.
   Special issue on the Advanced CompuTational Software (ACTS) Collection.

Exercises
---------

1. Take the running times of the OpenMP version of the method of
   Gauss-Seidel and compute the efficiency for each of the 9 cases.
   What can you conclude about the scalability?

2. Use MPI to write a parallel version of the method of Gauss-Seidel.
   Compare the speedups with the OpenMP version.

3. Run an example of the PETSc tutorials collection
   with an increasing number of processes to investigate the speedup.

4. Cellular automata (e.g.: Conway's game of life) are
   synchronized computations.  Discuss a parallel implementation
   of Conway's game of life and illustrate your discussion
   with a computation.