Parallel Iterative Methods for Linear Systems ============================================= We consider the method of Jacobi and introduce the ``MPI_Allgather`` command for the synchronization of the iterations. In the analysis of the communication and the computation cost, we determine the optimal value for the number of processors which minimizes the total cost. Jacobi Iterations ----------------- We derive the formulas for Jacobi's method, starting from a fixed point formula. We want to solve :math:`A {\bf x} = {\bf b}` for *A* an *n*-by-*n* matrix, and :math:`\bf b` an *n*-dimensional vector, for **very large** *n*. Consider :math:`A = L + D + U`, where * :math:`L = [\ell_{i,j}], \ell_{i,j} = a_{i,j}, i > j`, :math:`\ell_{i,j} = 0, i \leq j`. *L* is lower triangular. * :math:`D = [d_{i,j}], d_{i,i} = a_{i,i} \not=0, d_{i,j} = 0, i \not= j`. *D* is diagonal. * :math:`U = [u_{i,j}], u_{i,j} = a_{i,j}, i < j, u_{i,j} = 0, i \geq j`. *U* is upper triangular. Then we rewrite :math:`A {\bf x} = {\bf b}` as .. math:: \begin{array}{rcl} A {\bf x} = {\bf b} & \Leftrightarrow & (L + D + U) {\bf x} = {\bf b} \\ & \Leftrightarrow & D {\bf x} = {\bf b} - L {\bf x} - U {\bf x} \\ & \Leftrightarrow & D {\bf x} = D {\bf x} + {\bf b} - L {\bf x} - U {\bf x} - D {\bf x} \\ & \Leftrightarrow & D {\bf x} = D {\bf x} + {\bf b} - A {\bf x} \\ & \Leftrightarrow & {\bf x} = {\bf x} + D^{-1} ( {\bf b} - A {\bf x} ). \end{array} The fixed point formula :math:`{\bf x} = {\bf x} + D^{-1} ( {\bf b} - A {\bf x} )` is well defined if :math:`a_{i,i} \not= 0`. The fixed point formula :math:`{\bf x} = {\bf x} + D^{-1} ( {\bf b} - A {\bf x} )` leads to .. math:: {\bf x}^{(k+1)} = {\bf x}^{(k)} + \underbrace{D^{-1} \left( {\bf b} - A {\bf x}^{(k)} \right)}_{\Delta {\bf x}}, \quad k = 0,1,\ldots Writing the formula as an algorithm: :: Input: A, b, x(0), eps, N. Output: x(k), k is the number of iterations done. for k from 1 to N do dx := D**(-1) ( b - A x(k) ) x(k+1) := x(k) + dx exit when (norm(dx) <= eps) Counting the number of operations in the algorithm above, we have a cost of :math:`O(N n^2)`, :math:`O(n^2)` for :math:`A {\bf x}^{(k)}`, if :math:`A` is dense. .. topic:: Convergence of the Jacobi method The Jacobi method converges for strictly row-wise or column-wise diagonally dominant matrices, i.e.: if .. math:: |a_{i,i}| > \sum_{j \not= i} |a_{i,j}| \quad {\rm or} \quad |a_{i,i}| > \sum_{j \not= i} |a_{j,i}|, \quad i=1,2,\ldots,n. To run the code above with :math:`p` processors: * The :math:`n` rows of :math:`A` are distributed evenly (e.g.: :math:`p=4`): .. math:: D \star \left[ \!\!\! \begin{array}{c} \Delta {\bf x}^{[0]} \\ \hline \Delta {\bf x}^{[1]} \\ \hline \Delta {\bf x}^{[2]} \\ \hline \Delta {\bf x}^{[3]} \end{array} \!\!\! \right] = \left[ \!\!\! \begin{array}{c} {\bf b}^{[0]} \\ \hline {\bf b}^{[1]} \\ \hline {\bf b}^{[2]} \\ \hline {\bf b}^{[3]} \end{array} \!\!\! \right] - \left[ \!\!\! \begin{array}{cccccc} & A^{[0,0]} & A^{[0,1]} & A^{[0,2]} & A^{[0,3]} & \\ \hline & A^{[1,0]} & A^{[1,1]} & A^{[1,2]} & A^{[1,3]} & \\ \hline & A^{[2,0]} & A^{[2,1]} & A^{[2,2]} & A^{[2,3]} & \\ \hline & A^{[3,0]} & A^{[3,1]} & A^{[3,2]} & A^{[3,3]} & \end{array} \!\!\! \right] \star \left[ \!\!\! \begin{array}{c} {\bf x}^{[0],(k)} \\ \hline {\bf x}^{[1],(k)} \\ \hline {\bf x}^{[2],(k)} \\ \hline {\bf x}^{[3],(k)} \end{array} \!\!\! \right] * Synchronization is needed for $(|| \Delta {\bf x} || \leq \epsilon)$. For :math:`|| \cdot ||`, use :math:`||\Delta {\bf x}||_1 = |\Delta x_1| + |\Delta x_2| + \cdots + |\Delta x_n|`, the butterfly synchronizations are displayed in :numref:`figbutjacobi`. .. _figbutjacobi: .. figure:: ./figbutjacobi.png :align: center Butterfly synchronization of a parallel Jacobi iteration with 4 processors. The communication stages are as follows. At the start, every node must have :math:`{\bf x}^{(0)}`, :math:`\epsilon`, :math:`N`, a number of rows of :math:`A` and the corresponding part of the right hand side :math:`{\bf b}`. After each update :math:`n/p` elements of :math:`{\bf x}^{(k+1)}` must be scattered. The butterfly synchronization takes :math:`\log_2(p)` steps. The scattering of :math:`{\bf x}^{(k+1)}` can coincide with the butterfly synchronization. The computation effort: :math:`O(n^2/p)` in each stage. A Parallel Implementation with MPI ---------------------------------- For dimension :math:`n`, we consider the diagonally dominant system: .. math:: \left[ \begin{array}{cccc} n+1 & 1 & \cdots & 1 \\ 1 & n+1 & \cdots & 1 \\ \vdots & \vdots & \ddots & \vdots \\ 1 & 1 & \cdots & n+1 \\ \end{array} \right] \left[ \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_n \end{array} \right] = \left[ \begin{array}{c} 2 n \\ 2 n \\ \vdots \\ 2 n \end{array} \right]. The exact solution is :math:`{\bf x}`: for :math:`i=1,2,\ldots,n`, :math:`x_i = 1`. We start the Jacobi iteration method at :math:`{\bf x}^{(0)} = {\bf 0}`. The parameters are :math:`\epsilon = 10^{-4}` and :math:`N = 2 n^2`. A session where we run the program displays on screen the following: :: $ time /tmp/jacobi 1000 0 : 1.998e+03 1 : 1.994e+03 ... 8405 : 1.000e-04 8406 : 9.982e-05 computed 8407 iterations error : 4.986e-05 real 0m42.411s user 0m42.377s sys 0m0.028s C code to run Jacobi iterations is below. :: void run_jacobi_method ( int n, double **A, double *b, double epsilon, int maxit, int *numit, double *x ); /* * Runs the Jacobi method for A*x = b. * * ON ENTRY : * n the dimension of the system; * A an n-by-n matrix A[i][i] /= 0; * b an n-dimensional vector; * epsilon accuracy requirement; * maxit maximal number of iterations; * x start vector for the iteration. * * ON RETURN : * numit number of iterations used; * x approximate solution to A*x = b. */ :: void run_jacobi_method ( int n, double **A, double *b, double epsilon, int maxit, int *numit, double *x ) { double *dx,*y; dx = (double*) calloc(n,sizeof(double)); y = (double*) calloc(n,sizeof(double)); int i,j,k; for(k=0; k= 0.0) ? dx[i] : -dx[i]); } for(i=0; i= 0.0) ? dx[i] : -dx[i]); } for(i=istart; i