Parallel Gaussian Elimination ============================= LU and Cholesky Factorization ----------------------------- Consider as given a square *n*-by-*n* matrix *A* and corresponding right hand side vector :math:`\bf b` of length *n*. To solve an *n*-dimensional linear system :math:`A {\bf x} = {\bf b}` we factor *A* as a product of two triangular matrices, :math:`A = L U`: 1. *L* is lower triangular, :math:`L = [\ell_{i,j}]`, :math:`\ell_{i,j} = 0` if :math:`j > i` and :math:`\ell_{i,i} = 1`. 2. *U* is upper triangular :math:`U = [u_{i,j}]`, :math:`u_{i,j} = 0` if :math:`i > j`. Solving :math:`A {\bf x} = {\bf b}` is then equivalent to solving :math:`L(U {\bf x}) = {\bf b}` in two stages: 1. Forward substitution: :math:`L {\bf y} = {\bf b}`. 2. Backward substitution: :math:`U {\bf x} = {\bf y}`. Factoring *A* costs :math:`O(n^3)`, solving triangular systems costs :math:`O(n^2)`. For numerical stability, we apply partial pivoting and compute :math:`PA = LU`, where *P* is a permutation matrix. The steps in the LU factorization of the matrix *A* are shown in :numref:`figlufacsteps`. .. _figlufacsteps: .. figure:: ./figlufacsteps.png :align: center Steps in the LU factorization. For column :math:`j = 1,2,\ldots,n-1` in *A* do 1. find the largest element :math:`a_{i,j}` in column :math:`j` (for :math:`i \geq j`); 2. if :math:`i \not= j`, then swap rows :math:`i` and :math:`j`; 3. for :math:`i = j+1, \ldots n`, for :math:`k=j+1,\ldots,n` do :math:`\displaystyle a_{i,k} := a_{i,k} - \left( \frac{a_{i,j}}{a_{j,j}} \right) a_{j,k}`. If :math:`A` is symmetric, :math:`A^T = A`, and positive semidefinite: :math:`\forall {\bf x}: {\bf x}^T A {\bf x} \geq 0`, then we better compute a Cholesky factorization: :math:`A = L L^T`, where :math:`L` is a lower triangular matrix. Because :math:`A` is positive semidefinite, no pivoting is needed, and we need about half as many operations as LU. Pseudo code is listed below: :: for j from 1 to n do for k from 1 to j-1 do a[j][j] := a[j][j] - a[j][k]**2 a[j][j] := sqrt(a[j][j]) for i from j+1 to n do for k from 1 to j do a[i][j] := a[i][j] - a[i][k]*a[j][k] a[i][j] := a[i][j]/a[j][j] Let *A* be a symmetric, positive definite *n*-by-*n* matrix. In preparation for parallel implementation, we consider tiled matrices. For tile size *b*, let :math:`n = p \times b` and consider .. math:: A = \left[ \begin{array}{cccc} A_{1,1} & A_{2,1} & \cdots & A_{p,1} \\ A_{2,1} & A_{2,2} & \cdots & A_{p,2} \\ \vdots & \vdots & \ddots & \vdots \\ A_{p,1} & A_{p,2} & \cdots & A_{p,p} \end{array} \right], where :math:`A_{i,j}` is an *b*-by-*b* matrix. A crude classification of memory hierarchies distinguishes between registers (small), cache (medium), and main memory (large). To reduce data movements, we want to keep data in registers and cache as much as possible. Pseudo code for a tiled Cholesky factorization is listed below. :: for k from 1 to p do DPOTF2(A[k][k], L[k][k]) # L[k][k] := Cholesky(A[k][k]) for i from k+1 to p do DTRSM(L[k][k], A[i][k], L[i][k]) # L[i][k] := A[i][k]*L[k][k]**(-T) end for for i from k+1 to p do for j from k+1 to p do DGSMM(L[i][k], L[j][k], A[i][j]) # A[i][j] := A[i][j] - L[i][k]*L[j][k]}**T end for end for Blocked LU Factorization ------------------------ In deriving blocked formulations of LU, consider a 3-by-3 blocked matrix. The optimal size of the blocks is machine dependent. .. math:: \left[ \begin{array}{ccc} A_{1,1} & A_{1,2} & A_{1,3} \\ A_{2,1} & A_{2,2} & A_{2,3} \\ A_{3,1} & A_{3,2} & A_{3,3} \end{array} \right] = \left[ \begin{array}{ccc} L_{1,1} & & \\ L_{2,1} & L_{2,2} & \\ L_{3,1} & L_{3,2} & L_{3,3} \end{array} \right] \left[ \begin{array}{ccc} U_{1,1} & U_{1,2} & U_{1,3} \\ & U_{2,2} & U_{2,3} \\ & & U_{3,3} \end{array} \right] Expanding the right hand side and equating to the matrix at the left gives formulations for the LU factorization. .. math:: \begin{array}{lll} A_{1,1} = L_{1,1} U_{1,1} & A_{1,2} = L_{1,1} U_{1,2} & A_{1,3} = L_{1,1} U_{1,3} \\ A_{2,1} = L_{2,1} U_{1,1} & A_{2,2} = L_{2,1} U_{1,2} + L_{2,2} U_{2,2} & A_{2,3} = L_{2,1} U_{1,3} + L_{2,2} U_{2,3} \\ A_{3,1} = L_{3,1} U_{1,1} & A_{3,2} = L_{3,1} U_{1,2} + L_{3,2} U_{2,2} & A_{3,3} = L_{3,1} U_{1,3} + L_{3,2} U_{2,3} + L_{3,3} U_{3,3} \\ \end{array} The formulas we are deriving gives rise to the right looking LU. We store the :math:`L_{i,j}`'s and :math:`U_{i,j}`'s in the original matrix: .. math:: \left[ \begin{array}{ccc} A_{1,1} & A_{1,2} & A_{1,3} \\ A_{2,1} & A_{2,2} & A_{2,3} \\ A_{3,1} & A_{3,2} & A_{3,3} \end{array} \right] = \left[ \begin{array}{ccc} L_{1,1} & & \\ L_{2,1} & I & \\ L_{3,1} & & I \end{array} \right] \left[ \begin{array}{ccc} U_{1,1} & U_{1,2} & U_{1,3} \\ & B_{2,2} & B_{2,3} \\ & B_{3,2} & B_{3,3} \end{array} \right] The matrices :math:`B_{i,j}`'s are obtained after a first block LU step. To find :math:`L_{2,2}`, :math:`L_{3,2}`, and :math:`U_{2,2}` we use .. math:: \left\{ \begin{array}{l} A_{2,2} = L_{2,1} U_{1,2} + L_{2,2} U_{2,2} \\ A_{3,2} = L_{3,1} U_{1,2} + L_{3,2} U_{2,2} \end{array} \right. \quad {\rm and} \quad \left\{ \begin{array}{l} A_{2,2} = L_{2,1} U_{1,2} + B_{2,2} \\ A_{3,2} = L_{3,1} U_{1,2} + B_{3,2} \end{array} \right. Eliminating :math:`A_{2,2} - L_{2,1} U_{1,2}` and :math:`A_{3,2} - L_{3,1} U_{1,2}` gives .. math:: \left\{ \begin{array}{c} B_{2,2} = L_{2,2} U_{2,2} \\ B_{3,2} = L_{3,2} U_{2,2} \end{array} \right. Via LU on :math:`B_{2,2}` we obtain :math:`L_{2,2}` and :math:`U_{2,2}`. Then: :math:`L_{3,2} := B_{3,2} U_{2,2}^{-1}`. The formulas we derived are similar to the scalar case and are called *right looking*. But we may organize the LU factorization differently, as shown in :numref:`figrightleftlooking`. .. _figrightleftlooking: .. figure:: ./figrightleftlooking.png :align: center Right and left looking organization of tiled LU factorization. What is good looking? Left is best for data access. We derive left looking formulas. Going from :math:`P_1 A` to :math:`P_2 P_1 A`: .. math:: \left[ \begin{array}{ccc} L_{1,1} & \!\! & \!\! \\ L_{2,1} & \!\! I & \!\! \\ L_{3,1} & \!\! & \!\! I \end{array} \right] \left[ \begin{array}{ccc} U_{1,1} & \!\! A_{1,2} & \!\! A_{1,3} \\ & \!\! A_{2,2} & \!\! A_{2,3} \\ & \!\! A_{3,2} & \!\! A_{3,3} \end{array} \right] \rightarrow \left[ \begin{array}{ccc} L_{1,1} & \!\! & \!\! \\ L_{2,1} & \!\! L_{2,2} & \!\! \\ L_{3,1} & \!\! L_{3,2} & \!\! I \end{array} \right] \left[ \begin{array}{ccc} U_{1,1} & \!\! U_{1,2} & \!\! A_{1,3} \\ & \!\! U_{2,2} & \!\! A_{2,3} \\ & \!\! & \!\! A_{3,3} \end{array} \right] We keep the original :math:`A_{i,j}`'s and postpone updating to the right. 1. We get :math:`U_{1,2}` via :math:`A_{1,2} = L_{1,1} U_{1,2}` and compute :math:`U_{1,2} = L_{1,1}^{-1} A_{1,2}`. 2. To compute :math:`L_{2,2}` and :math:`L_{3,2}` do :math:`\displaystyle \left[ \begin{array}{c} B_{2,2} \\ B_{3,2} \end{array} \right] = \left[ \begin{array}{c} A_{2,2} \\ A_{3,2} \end{array} \right] - \left[ \begin{array}{c} L_{2,1} \\ L_{3,1} \end{array} \right] U_{1,2}`; and factor :math:`\displaystyle P_2 \left[ \begin{array}{c} B_{2,2} \\ B_{3,2} \end{array} \right] = \left[ \begin{array}{c} L_{2,2} \\ L_{3,2} \end{array} \right] U_{2,2}` as before. Replace :math:`\displaystyle \left[ \begin{array}{c} A_{2,3} \\ A_{3,3} \end{array} \right] := P_2 \left[ \begin{array}{c} A_{2,3} \\ A_{3,3} \end{array} \right]` and :math:`\displaystyle \left[ \begin{array}{c} L_{2,1} \\ L_{3,1} \end{array} \right] := P_2 \left[ \begin{array}{c} L_{2,1} \\ L_{3,1} \end{array} \right]`. Pseudo code for the tiled algorithm for LU factorization is below. :: for k from 1 to p do DGETF(A[k][k], L[k][k], U[k][k], P[k][k]) for j from k+1 to p do DGESSM(A[k][j], L[k][k], P[k][k], U[k][j]) for i from k+1 to p do DTSTRF(U[k][k], A[i][k], P[i][k]) for j from k+1 to p do DSSSM(U[k][j], A[i][j], L[i][k], P[i][k]) where the function calls correspond to the following formulas: * ``DGETF(A[k][k], L[k][k], U[k][k], P[k][k])`` corresponds to :math:`L_{k,k}, U_{k,k}, P_{k,k} := \mbox{LU}(A_{k,k})`. * ``DGESSM(A[k][j], L[k][k], P[k][k], U[k][j])`` corresponds to :math:`U_{k,j} := L_{k,k}^{-1} P_{k,k} A_{k,j}`. * ``DTSTRF(U[k][k], A[i][k], P[i][k])`` corresponds to :math:`\displaystyle U_{k,k}, L_{i,k}, P_{i,k} := LU \left( \left[ \begin{array}{c} U_{k,k} \\ A_{i,k} \end{array} \right] \right)` * ``DSSSM(U[k][j], A[i][j], L[i][k], P[i][k])`` corresponds to :math:`\displaystyle \left[ \begin{array}{c} U_{k,j} \\ A_{i,j} \end{array} \right] := L_{i,k}^{-1} P_{i,k} \left[ \begin{array}{c} U_{k,j} \\ A_{i,j} \end{array} \right]`. The PLASMA Software Library --------------------------- PLASMA stands for Parallel Linear Algebra Software for Multicore Architectures. The software uses FORTRAN and C. It is designed for efficiency on homogeneous multicore processors and multi-socket systems of multicore processors, Built using a small set of sequential routines as building blocks, referred to as *core BLAS*, and free to download from . Capabilities and limitations: Can solve dense linear systems and least squares problems. Unlike LAPACK, PLASMA currently does not solve eigenvalue or singular value problems and provide no support for band matrices. Basic Linear Algebra Subprograms: BLAS 1. Level-1 BLAS: vector-vector operations, :math:`O(n)` cost. inner products, norms, :math:`{\bf x} \pm {\bf y}`, :math:`\alpha {\bf x} + {\bf y}`. 2. Level-2 BLAS: matrix-vector operations, :math:`O(mn)` cost. * :math:`{\bf y} = \alpha A {\bf x} + \beta {\bf y}` * :math:`A = A + \alpha {\bf x} {\bf y}^T`, rank one update * :math:{\bf x} = T^{-1} {\bf b}`, for :math:`T` a triangular matrix 3. Level-3 BLAS: matrix-matrix operations, :math:`O(kmn)` cost. * :math:`C = \alpha AB + \beta C` * :math:`C = \alpha AA^T + \beta C`, rank *k* update of symmetric matrix * :math:`B = \alpha T B`, for :math:`T` a triangular matrix * :math:`B = \alpha T^{-1} B`, solve linear system with many right hand sides The execution is asynchronous and graph driven. We view a blocked algorithm as a Directed Acyclic Graph (DAG): nodes are computational tasks performed in kernel subroutines; edges represent the dependencies among the tasks. Given a DAG, tasks are scheduled asynchronously and independently, considering the dependencies imposed by the edges in the DAG. A critical path in the DAG connects those nodes that have the highest number of outgoing edges. The scheduling policy assigns higher priority to those tasks that lie on the critical path. The directory ``examples`` in ``/usr/local/plasma-installer_2.6.0/build/plasma_2.6.0`` contains ``example_cposv``, an example for a Cholesky factorization of a symmetric positive definite matrix. Running ``make`` as defined in the ``examples`` directory: :: [root@kepler examples]# make example_cposv gcc -O2 -DADD_ -I../include -I../quark \ -I/usr/local/plasma-installer_2.6.0/install/include -c \ example_cposv.c -o example_cposv.o gfortran example_cposv.o -o example_cposv -L../lib -lplasma \ -lcoreblasqw -lcoreblas -lplasma -L../quark -lquark \ -L/usr/local/plasma-installer_2.6.0/install/lib -lcblas \ -L/usr/local/plasma-installer_2.6.0/install/lib -llapacke \ -L/usr/local/plasma-installer_2.6.0/install/lib -ltmg -llapack \ -L/usr/local/plasma-installer_2.6.0/install/lib -lrefblas \ -lpthread -lm [root@kepler examples]# Cholesky factorization with the dimension equal to 10,000. :: [root@kepler examples]# time ./example_dposv -- PLASMA is initialized to run on 2 cores. ============ Checking the Residual of the solution -- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||_oo).N.eps) = 2.710205e-03 -- The solution is CORRECT ! -- Run of DPOSV example successful ! real 1m22.271s user 2m42.534s sys 0m1.064s [root@kepler examples]# The wall clock times for Cholesky on dimension 10,000 are listed in :numref:`tabtimescholesky`. .. _tabtimescholesky: .. table:: Wall clock times for Cholesky factorization. +--------+------------+---------+ | #cores | real time | speedup | +========+============+=========+ | 1 | 2m 40.913s | | +--------+------------+---------+ | 2 | 1m 22.271s | 1.96 | +--------+------------+---------+ | 4 | 42.621s | 3.78 | +--------+------------+---------+ | 8 | 23.480s | 6.85 | +--------+------------+---------+ | 16 | 12.647s | 12.72 | +--------+------------+---------+ Bibliography ------------ 1. E. Agullo, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and A. YarKhan. *PLASMA Users' Guide. Parallel Linear Algebra Software for Multicore Architectures. Version 2.0*. 2. A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. *Parallel Computing* 35: 38-53, 2009. Exercises --------- 1. Write your own parallel shared memory version of the Cholesky factorization, using OpenMP, Pthreads, or the Intel TBB. 2. Derive right looking LU factorization formulas with pivoting, i.e.: introducing permutation matrices *P*. Develop first the formulas for a 3-by-3 block matrix and then generalize the formulas into an algorithm for any *p*-by-*p* block matrix. 3. Take an appropriate example from the PLASMA installation to test the speedup of the multicore LU factorization routines.