Parallel Numerical Linear Algebra ================================= QR Factorization ---------------- To solve an overconstrained linear system :math:`A {\bf x} = {\bf b}` of *m* equations in *n* variables, where :math:`m > n`, we factor :math:`A` as a product of two matrices, :math:`A = Q R`: 1. *Q* is an orthogonal matrix: :math:`Q^H Q = I`, the inverse of *Q* is its Hermitian transpose :math:`Q^H`; 2. *R* is upper triangular: :math:`R = [r_{i,j}]`, :math:`r_{i,j} = 0` if :math:`i > j`. Solving :math:`A {\bf x} = {\bf b}` is equivalent to solving :math:`Q(R {\bf x}) = {\bf b}`. We multiply :math:`\bf b` with :math:`Q^T` and then apply backward substitution: :math:`R {\bf x} = Q^T {\bf b}`. The solution :math:`\bf x` minimizes the square of the residual :math:`|| b - A {\bf x} ||_2^2` (least squares). .. topic:: Householder transformation For :math:`{\bf v} \not=0`: :math:`\displaystyle H = I - \frac{{\bf v} {\bf v}^T}{{\bf v}^T {\bf v}}` is *a Householder transformation*. By its definition, :math:`H = H^T = H^{-1}`. Because of their geometrical interpretation, Householder transformations are also called Householder reflectors. For some *n*-dimensional real vector :math:`\bf x`, let :math:`{\bf e}_1 = (1,0,\ldots,0)^T`: .. math:: {\bf v} = {\bf x} - ||{\bf x}||_2 {\bf e}_1 \quad \Rightarrow \quad H {\bf x} = ||{\bf x}||_2 {\bf e}_1. With *H* we eliminate, reducing *A* to an upper triangular matrix. Householder QR: Q is a product of Householder reflectors. The LAPACK ``DGEQRF`` on an *m*-by-*n* matrix produces: .. math:: H_1 H_2 \cdots H_n = I - V T V^T where each :math:`H_i` is a Householder reflector matrix with associated vectors :math:`{\bf v}_i` in the columns of *V* and *T* is an upper triangular matrix. For an $m$-by-$n$ matrix $A = [a_{i,j}]$ with columns $\bfa_j$, the formulas are formatted as pseudo code for a Householder QR factorization in :numref:`fighouseqrfactor`. .. _fighouseqrfactor: .. figure:: ./fighouseqrfactor.png :align: center Pseudo code for Househoulder QR factorization. For a parallel implementation, we consider tiled matrices, dividing the *m*-by-*n* matrix *A* into *b*-by-*b* blocks, :math:`m = b \times p` and :math:`n = b \times q`. Then we consider *A* as an :math:`(p \times b)`-by-:math:`(q \times b)` matrix: .. math:: A = \left[ \begin{array}{cccc} A_{1,1} & A_{1,2} & \cdots & A_{1,q} \\ A_{2,1} & A_{2,2} & \cdots & A_{2,q} \\ \vdots & \vdots & \ddots & \vdots \\ A_{p,1} & A_{p,2} & \cdots & A_{p,q} \end{array} \right], where each :math:`A_{i,j}` is an *b*-by-*b* matrix. A crude classification of memory hierarchies distinguishes between registers (small), cache (medium), and main memory (large). To reduce data movements, we want to keep data in registers and cache as much as possible. To introduce QR factorization on a tiled matrix, consider for example .. math:: A = \left[ \begin{array}{cc} A_{1,1} & A_{1,2} \\ A_{2,1} & A_{2,2} \end{array} \right] = Q \left[ \begin{array}{cc} R_{1,1} & R_{1,2} \\ 0 & R_{2,2} \end{array} \right]. We proceed in three steps. 1. Perform a QR factorization: .. math:: \begin{array}{l} A_{1,1} = Q_1 B_{1,1} \\ (Q_1,B_{1,1}) := {\rm QR}(A_{1,1}) \end{array} \quad {\rm and} \quad \begin{array}{l} A_{1,2} = Q_1 B_{1,2} \\ B_{1,2} = Q_1^T A_{1,2} \end{array} 2. Coupling square blocks: .. math:: (Q_2, R_{1,1}) := {\rm QR} \left( \left[ \begin{array}{c} B_{1,1} \\ A_{2,1} \end{array} \right] \right) \quad {\rm and} \quad \left[ \begin{array}{c} R_{1,2} \\ B_{2,2} \end{array} \right] := Q_2^T \left[ \begin{array}{c} R_{1,2} \\ A_{2,2} \end{array} \right]. 3. Perform another QR factorization: :math:`(Q_3,R_{2,2}) := {\rm QR}(B_{2,2})`. Pseudo code for a tiled QR factorization is given below. :: for k from 1 to min(p,q) do DGEQRT(A[k][k], V[k][k], R[k][k], T[k][k]) for i from k+1 to p do DLARFB(A[k][j], V[k][k], T[k][k], R[k][j]) for i from k+1 to p do DTSQRT(R[k][k], A[i][k], V[i][k], T[i][k]) for j from k+1 to p do DSSRFB(R[k][j], A[i][j], V[i][k], T[i][k]) where the function calls correspond to the folllowing mathematical formulas: * ``DGEQRT(A[k][k], V[k][k], R[k][k], T[k][k])`` corresponds to :math:`V_{k,k}, R_{k,k}, T_{k,k} := \mbox{QR}(A_{k,k})`. * ``DLARFB(A[k]][j], V[k][k], T[k][k], R[k][j])`` corresponds to :math:`R_{k,j} := ( I - V_{k,k} T_{k,k} V_{k,k}^T ) A_{k,j}`. * ``DTSQRT(R{k][k], A[i][k], V[i][k], T[i][k])`` corresponds to .. math:: (V_{i,k},T_{i,k},R_{k,k}) := {\rm QR}\left( \left[ \begin{array}{c} \! R_{k,k} \! \\ \! A_{i,k} \! \end{array} \right] \right). * ``DSSRFB(R[k][j], A[i][j], V[i][k], T[i][k])`` corresponds to .. math:: \left[ \begin{array}{c} \!\! R_{k,j} \!\! \\ \!\! A_{i,j} \!\! \end{array} \right] := ( I - V_{i,k} T_{i,k} V_{i,k}^T) \left[ \begin{array}{c} \!\! R_{k,j} \!\! \\ \!\! A_{i,j} \!\! \end{array} \right]. The directory ``examples`` in ``/usr/local/plasma-installer_2.6.0/build/plasma_2.6.0`` contains ``example_dgeqrs``, an example for solving an overdetermined linear system with a QR factorization. Compiling on a MacBook Pro in the ``examples`` directory: :: $ sudo make example_dgeqrs gcc -O2 -DADD_ -I../include -I../quark -I/usr/local/plasma-installer_2.6.0/install/include -c example_dgeqrs.c -o example_dgeqrs.o gfortran example_dgeqrs.o -o example_dgeqrs -L../lib -lplasma -lcoreblasqw -lcoreblas -lplasma -L../quark -lquark -L/usr/local/plasma-installer_2.6.0/install/lib -llapacke -L/usr/local/plasma-installer_2.6.0/install/lib -ltmg -framework veclib -lpthread -lm $ On ``kepler`` we do the same. To make the program in our own home directory, we define a makefile: :: PLASMA_DIR=/usr/local/plasma-installer_2.6.0/build/plasma_2.6.0 example_dgeqrs: gcc -O2 -DADD_ -I$(PLASMA_DIR)/include -I$(PLASMA_DIR)/quark \ -I/usr/local/plasma-installer_2.6.0/install/include \ -c example_dgeqrs.c -o example_dgeqrs.o gfortran example_dgeqrs.o -o example_dgeqrs \ -L$(PLASMA_DIR)/lib -lplasma -lcoreblasqw \ -lcoreblas -lplasma -L$(PLASMA_DIR)/quark -lquark \ -L/usr/local/plasma-installer_2.6.0/install/lib -lcblas \ -L/usr/local/plasma-installer_2.6.0/install/lib -llapacke \ -L/usr/local/plasma-installer_2.6.0/install/lib -ltmg -llapack \ -L/usr/local/plasma-installer_2.6.0/install/lib \ -lrefblas -lpthread -lm Typing ``make example_dgeqrs`` will then compile and link. Running ``example_dgeqrs`` with dimension 4000: :: $ time ./example_dgeqrs -- PLASMA is initialized to run on 1 cores. ============ Checking the Residual of the solution -- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||)_oo.N.eps) = 2.829153e-02 -- The solution is CORRECT ! -- Run of DGEQRS example successful ! real 0m42.172s user 0m41.965s sys 0m0.111s $ Results of ``time example_dgeqrs``, for dimension 4000, are listed in :numref:`tabexdgeqrsresults`. .. _tabexdgeqrsresults: .. table:: Timings on dgeqrs for n = 4000, with p threads. +----+--------+--------+-------+---------+ | p | real | user | sys | speedup | +====+========+========+=======+=========+ | 1 | 42.172 | 41.965 | 0.111 | | +----+--------+--------+-------+---------+ | 2 | 21.495 | 41.965 | 0.154 | 1.96 | +----+--------+--------+-------+---------+ | 4 | 11.615 | 43.322 | 0.662 | 3.63 | +----+--------+--------+-------+---------+ | 8 | 6.663 | 46.259 | 1.244 | 6.33 | +----+--------+--------+-------+---------+ | 16 | 3.872 | 46.793 | 2.389 | 10.89 | +----+--------+--------+-------+---------+ Conjugate Gradient and Krylov Subspace Methods ---------------------------------------------- We link linear systems solving to an optimization problem. Let *A* be a positive definite matrix: :math:`\forall {\bf x}: {\bf x}^T A {\bf x} \geq 0` and :math:`A^T = A`. The optimum of .. math:: q({\bf x}) = \frac{1}{2} {\bf x}^T A {\bf x} - {\bf x}^T {\bf b} \mbox{ is at } A {\bf x} - {\bf b} = {\bf 0}. For the exact solution :math:`\bf x`: :math:`A {\bf x} = {\bf b}` and an approximation :math:`{\bf x}_k`, let the error be :math:`{\bf e}_k = {\bf x}_k - {\bf x}`. .. math:: \begin{array}{lcr} || {\bf e}_k ||_A^2 ~~= ~~ {\bf e}_k^T A {\bf e}_k & = & ({\bf x}_k - {\bf x})^T A ({\bf x}_k - {\bf x}) \\ & = & {\bf x}_k^T A {\bf x}_k - 2 {\bf x}_k^T A {\bf x} + {\bf x}^T A {\bf x} \\ & = & {\bf x}_k^T A {\bf x}_k - 2 {\bf x}_k^T {\bf b} + c \\ & = & 2 q({\bf x}_k) + c \end{array} :math:`\Rightarrow` minimizing :math:`q({\bf x})` is the same as minimizing the error. The conjugate gradient method is similar to the steepest descent method, formulated in :numref:`figconjugategradients`. .. _figconjugategradients: .. figure:: ./figconjugategradients.png :align: center Mathematical definition of the conjugate gradient method. For large sparse matrices, the conjugate gradient method is much faster than Cholesky. Krylov subspace methods ^^^^^^^^^^^^^^^^^^^^^^^ We consider spaces spanned by matrix-vector products. * Input: right hand side vector :math:`\bf b` of dimension n, and some algorithm to evaluate :math:`A {\bf x}`. * Output: approximation for :math:`A^{-1} {\bf b}`. Example (:math:`n=4`), consider :math:`{\bf y}_1 = {\bf b}`, :math:`{\bf y}_2 = A {\bf y}_1`, :math:`{\bf y}_3 = A {\bf y}_2`, :math:`{\bf y}_4 = A {\bf y}_3`, stored in the columns of :math:`K = [{\bf y}_1 ~ {\bf y}_2 ~ {\bf y}_3 ~ {\bf y}_4]`. .. math:: \begin{array}{lcr} A K & = & [A {\bf y}_1 ~ A {\bf y}_2 ~ A {\bf y}_3 ~ A {\bf y}_4] ~~ = ~~ [{\bf y}_2 ~ {\bf y}_3 ~ {\bf y}_4 ~ A^4 {\bf y}_1 ] \\ & = & K \left[ \begin{array}{cccc} 0 & 0 & 0 & -c_1 \\ 1 & 0 & 0 & -c_2 \\ 0 & 1 & 0 & -c_3 \\ 0 & 0 & 1 & -c_4 \\ \end{array} \right], \quad {\bf c} = - K^{-1} A^4 {\bf y}_1. \end{array} :math:`\Rightarrow AK = KC`, where :math:`C` is a companion matrix. Set :math:`K = [ {\bf b} ~~ A {\bf b} ~~ A^2 {\bf b} ~~ \cdots ~~ A^k {\bf b}]`. Compute :math:`K = QR`, then the columns of *Q* span the *Krylov subspace*. Look for an approximate solution of :math:`A {\bf x} = {\bf b}` in the span of the columns of *Q*. We can interpret this as a modified Gram-Schmidt method, stopped after *k* projections of :math:`\bf b`. The conjugate gradient and Krylov subspace methods rely mostly on matrix-vector products. Parallel implementations involve the distribution of the matrix among the processors. For general matrices, tiling scales best. For specific sparse matrices, the patterns of the nonzero elements will determine the optimal distribution. PETSc (see Lecture 20) provides a directory of examples ``ksp`` of Krylov subspace methods. We will take ``ex3`` of ``/usr/local/petsc-3.4.3/src/ksp/ksp/examples/tests`` To get the makefile, we compile in the examples and then copy the compilation and linking instructions. The makefile on kepler looks as follows: :: PETSC_DIR=/usr/local/petsc-3.4.3 PETSC_ARCH=arch-linux2-c-debug ex3: $(PETSC_DIR)/$(PETSC_ARCH)/bin/mpicc -o ex3.o -c -fPIC -Wall \ -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -g3 \ -fno-inline -O0 -I/usr/local/petsc-3.4.3/include \ -I/usr/local/petsc-3.4.3/arch-linux2-c-debug/include \ -D__INSDIR__=src/ksp/ksp/examples/tests/ ex3.c $(PETSC_DIR)/$(PETSC_ARCH)/bin/mpicc -fPIC -Wall -Wwrite-strings \ -Wno-strict-aliasing -Wno-unknown-pragmas -g3 -fno-inline -O0 \ -o ex3 ex3.o \ -Wl,-rpath,/usr/local/petsc-3.4.3/arch-linux2-c-debug/lib \ -L/usr/local/petsc-3.4.3/arch-linux2-c-debug/lib -lpetsc \ -Wl,-rpath,/usr/local/petsc-3.4.3/arch-linux2-c-debug/lib \ -lflapack -lfblas -lX11 -lpthread -lm \ -Wl,-rpath,/usr/gnat/lib/gcc/x86_64-pc-linux-gnu/4.7.4 \ -L/usr/gnat/lib/gcc/x86_64-pc-linux-gnu/4.7.4 \ -Wl,-rpath,/usr/gnat/lib64 -L/usr/gnat/lib64 \ -Wl,-rpath,/usr/gnat/lib -L/usr/gnat/lib -lmpichf90 -lgfortran \ -lm -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.4.7 \ -L/usr/lib/gcc/x86_64-redhat-linux/4.4.7 -lm -lmpichcxx -lstdc++ \ -ldl -lmpich -lopa -lmpl -lrt -lpthread -lgcc_s -ldl /bin/rm -f ex3.o Running the example produces the following: :: # time /usr/local/petsc-3.4.3/arch-linux2-c-debug/bin/mpiexec \ -n 16 ./ex3 -m 200 Norm of error 0.000622192 Iterations 162 real 0m1.855s user 0m9.326s sys 0m4.352s # time /usr/local/petsc-3.4.3/arch-linux2-c-debug/bin/mpiexec \ -n 8 ./ex3 -m 200 Norm of error 0.000387934 Iterations 131 real 0m2.156s user 0m10.363s sys 0m1.644s # time /usr/local/petsc-3.4.3/arch-linux2-c-debug/bin/mpiexec \ -n 4 ./ex3 -m 200 Norm of error 0.00030938 Iterations 105 real 0m4.526s user 0m15.108s sys 0m1.123s # time /usr/local/petsc-3.4.3/arch-linux2-c-debug/bin/mpiexec \ -n 2 ./ex3 -m 200 Norm of error 0.000395339 Iterations 82 real 0m20.606s user 0m37.036s sys 0m1.996s # time /usr/local/petsc-3.4.3/arch-linux2-c-debug/bin/mpiexec \ -n 1 ./ex3 -m 200 Norm of error 0.000338657 Iterations 83 real 1m34.991s user 1m25.251s sys 0m9.579s Bibliography ------------ 1. S. Balay, J. Brown, K. Buschelman, V. Eijkhout, W. Gropp, D. Kaushik, M. Knepley, L. Curfman McInnes, B. Smith, and H. Zhang. *PETSc Users Manual. Revision 3.4*. Mathematics and Computer Science Division, Argonne National Laboratory, 15 October 2013. 2. E. Agullo, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and A. YarKhan. *PLASMA Users' Guide. Parallel Linear Algebra Software for Multicore Architectures. Version 2.0*. 3. A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. *Parallel Computing* 35: 38-53, 2009. Exercises --------- 1. Take a 3-by-3 block matrix, apply the tiled QR algorithm step by step, indicating which block is processed in each step along with the type of operations. Draw a Directed Acyclic Graph linking steps that have to wait on each other. On a 3-by-3 block matrix, what is the number of cores needed to achieve the best speedup? 2. Using the ``time`` command to measure the speedup of ``example_dgeqrs`` also takes into account all operations to define the problem and to test the quality of the solution. Run experiments with modifications of the source code to obtain more accurate timings of the parallel QR factorization.