Solving Triangular Systems ========================== Triangular linear system occur as the result of an LU or a QR decomposition. Ill Conditioned Matrices and Quad Doubles ----------------------------------------- Consider the 4-by-4 lower triangular matrix .. math:: L = \left[ \begin{array}{rrrr} 1 & 0 & 0 & 0 \\ -2 & 1 & 0 & 0 \\ -2 & -2 & 1 & 0 \\ -2 & -2 & -2 & 1 \end{array} \right]. What we know from numerical analysis: 1. The condition number of a matrix magnifies roundoff errors. 2. The hardware double precision is :math:`2^{-52} \approx 2.2 \times 10^{-16}`. 3. We get no accuracy from condition numbers larger than :math:`10^{16}`. An experiment in an interactive Julia session illustate that ill conditioned matrices occur already in modest dimensions. :: julia> using LinearAlgebra julia> A = ones(32,32); julia> D = Diagonal(A); julia> L = LowerTriangular(A); julia> LmD = L - D; julia> L2 = D - 2*LmD; julia> cond(L2) 2.41631630569077e16 The condition number is estimated at :math:`2.4 \times 10^{16}`. A floating-point number consists of a sign bit, exponent, and a fraction (also known as the mantissa). Almost all microprocessors follow the IEEE 754 standard. GPU hardware supports 32-bit (single float) and for compute capability :math:`\geq 1.3` also double floats. Numerical analysis studies algorithms for continuous problems, problems for their sensitivity to errors in the input; and algorithms for their propagation of roundoff errors. The floating-point addition is *not* associative! Parallel algorithms compute and accumulate the results in an order that is different from their sequential versions. For example, adding a sequence of numbers is more accurate if the numbers are sorted in increasing order. Instead of speedup, we can ask questions about quality up: * If we can afford to keep the total running time constant, does a faster computer give us more accurate results? * How many more processors do we need to guarantee a result? A quad double is an unevaluated sum of 4 doubles, improves the working precision from :math:`2.2 \times 10^{-16}` to :math:`2.4 \times 10^{-63}`. The software ``QDlib`` is presented in the paper *Algorithms for quad-double precision floating point arithmetic* by Y. Hida, X.S. Li, and D.H. Bailey, published in the 15th IEEE Symposium on Computer Arithmetic, pages 155-162. IEEE, 2001. The software is available at . A quad double builds on ``double double``. Some features of working with doubles double are: * The least significant part of a double double can be interpreted as a compensation for the roundoff error. * Predictable overhead: working with double double is of the same cost as working with complex numbers. Consider Newton's method to compute :math:\sqrt{x}: as defined in the code below. :: #include #include #include using namespace std; qd_real newton ( qd_real x ) { qd_real y = x; for(int i=0; i<10; i++) y -= (y*y - x)/(2.0*y); return y; } The main program is as follows. :: int main ( int argc, char *argv[] ) { cout << "give x : "; qd_real x; cin >> x; cout << setprecision(64); cout << " x : " << x << endl; qd_real y = newton(x); cout << " sqrt(x) : " << y << endl; qd_real z = y*y; cout << "sqrt(x)^2 : " << z << endl; return 0; } If the program is in the file ``newton4sqrt.cpp`` and the makefile contains :: QD_ROOT=/usr/local/qd-2.3.13 QD_LIB=/usr/local/lib newton4sqrt: g++ -I$(QD_ROOT)/include newton4sqrt.cpp \ $(QD_LIB)/libqd.a -o /tmp/newton4sqrt then we can create the executable, simply typing ``make newton4sqrt``. A run with the code for :math:`sqrt{2}` is shown below. :: 2.0000000000000000000000000000000000000000000000000000000000000000e+00 1.5000000000000000000000000000000000000000000000000000000000000000e+00 1.4166666666666666666666666666666666666666666666666666666666666667e+00 1.4142156862745098039215686274509803921568627450980392156862745098e+00 1.4142135623746899106262955788901349101165596221157440445849050192e+00 1.4142135623730950488016896235025302436149819257761974284982894987e+00 1.4142135623730950488016887242096980785696718753772340015610131332e+00 1.4142135623730950488016887242096980785696718753769480731766797380e+00 1.4142135623730950488016887242096980785696718753769480731766797380e+00 residual : 0.0000e+00 On a Parallel Shared Memory Computer with OpenMP ------------------------------------------------ We will rewrite the formulas for forward substitution. Expanding the matrix-vector product :math:`L {\bf y}` in :math:`L {\bf y} = {\bf b}` leads to .. math:: \left\{ \begin{array}{lcr} y_1 & = & b_1 \\ \ell_{2,1} y_1 + y_2 & = & b_2 \\ \ell_{3,1} y_1 + \ell_{3,2} y_2 + y_3 & = & b_3 \\ & \vdots & \\ \ell_{n,1} y_1 + \ell_{n,2} y_2 + \ell_{n,3} y_3 + \cdots + \ell_{n,n-1} y_{n-1} + y_n & = & b_n \end{array} \right. and solving for the diagonal elements gives .. math:: \begin{array}{lcl} y_1 & = & b_1 \\ y_2 & = & b_2 - \ell_{2,1} y_1 \\ y_3 & = & b_3 - \ell_{3,1} y_1 - \ell_{3,2} y_2 \\ & \vdots & \\ y_n & = & b_n - \ell_{n,1} y_1 - \ell_{n,2} y_2 - \cdots - \ell_{n,n-1} y_{n-1} \\ \end{array} In rewriting the formulas, consider the case for :math:`n=5`. Solving :math:`L {\bf y} = {\bf b}` for :math:`n = 5`: 1. :math:`{\bf y} := {\bf b}` 2. :math:`y_2 := y_2 - \ell_{2,1} \star y_1` :math:`y_3 := y_3 - \ell_{3,1} \star y_1` :math:`y_4 := y_4 - \ell_{4,1} \star y_1` :math:`y_5 := y_5 - \ell_{5,1} \star y_1` 3. :math:`y_3 := y_3 - \ell_{3,2} \star y_2` :math:`y_4 := y_4 - \ell_{4,2} \star y_2` :math:`y_5 := y_5 - \ell_{5,2} \star y_2` 4. :math:`y_4 := y_4 - \ell_{4,3} \star y_3` :math:`y_5 := y_5 - \ell_{5,3} \star y_3` 5. :math:`y_5 := y_5 - \ell_{5,4} \star y_4` In the algorithm .. math:: \begin{array}{l} {\bf y} := {\bf b} \\ \mbox{for } i \mbox{ from 2 to } n \mbox{ do} \\ \quad \mbox{for } j \mbox{ from } i \mbox{ to } n \mbox{ do} \\ \quad \quad y_j := y_j - \ell_{j,i-1} \star y_{i-1} \end{array} Observe that all instructions in the :math:`j` loop are independent from each other! Considering the inner loop in the algorithm to solve :math:`L {\bf y} = {\bf b}`, we distribute the update of :math:`y_i, y_{i+1}, \ldots, y_n` among :math:`p` processors. If :math:`n \gg p`, then we expect a close to optimal speedup. For our parallel solver for triangular systems: * For :math:`L = [ \ell_{i,j} ]`, we generate random numbers for :math:`\ell_{i,j} \in [0,1]`. The exact solution :math:`\bf y`: :math:`y_i = 1`, for :math:`i=1,2,\ldots,n`. We compute the right hand side :math:`{\bf b} = L {\bf y}`. * Even already in small dimensions, the condition number may grow exponentially. Hardware double precision is insufficient. Therefore, we use quad double arithmetic. * We use a straightforward OpenMP implementation. In solving random lower triangular systems, relying on hardware doubles is problematic: :: $ time ./trisol 10 last number : 1.0000000000000009e+00 real 0m0.003s user 0m0.001s sys 0m0.002s $ time ./trisol 100 last number : 9.9999999999974221e-01 real 0m0.005s user 0m0.001s sys 0m0.002s $ time ./trisol 1000 last number : 2.7244600009080568e+04 real 0m0.036s user 0m0.025s sys 0m0.009s For a matrix of quad doubles, allocating data in the main program is done in the code snippet below. :: qd_real b[n],y[n]; qd_real **L; L = (qd_real**) calloc(n,sizeof(qd_real*)); for(int i=0; i. * N. J. Higham. **Stability of parallel triangular system solvers.** *SIAM J. Sci. Comput.*, 16(2):400-413, 1995. * M. Joldes, J.-M. Muller, V. Popescu, W. Tucker. **CAMPARY: Cuda Multiple Precision Arithmetic Library and Applications.** In *Mathematical Software -- ICMS 2016, the 5th International Conference on Mathematical Software*, pages 232-240, Springer-Verlag, 2016. * M. Lu, B. He, and Q. Luo. **Supporting extended precision on graphics processors.** In A. Ailamaki and P.A. Boncz, editors, *Proceedings of the Sixth International Workshop on Data Management on New Hardware (DaMoN 2010), June 7, 2010, Indianapolis, Indiana*, pages 19-26, 2010. Software at . * W. Nasri and Z. Mahjoub. **Optimal parallelization of a recursive algorithm for triangular matrix inversion on MIMD computers.** *Parallel Computing*, 27:1767-1782, 2001. * J. Verschelde. **Least squares on GPUs in multiple double precision.** In *The 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)*, pages 828-837. IEEE, 2022. Code at . * D. Viswanath and L. N. Trefethen. **Condition numbers of random triangular matrices.** *SIAM J. Matrix Anal. Appl.*, 19(2):564-581, 1998. Exercises --------- 1. Write a parallel solver with OpenMP to solve :math:`U {\bf x} = {\bf y}`. Take for :math:`U` a matrix with random numbers in :math:`[0,1]`, compute :math:`\bf y` so all components of :math:`\bf x` equal one. Test the speedup of your program, for large enough values of :math:`n` and a varying number of cores. 2. Describe a parallel solver for upper triangular systems :math:`U {\bf y} = {\bf b}` for distributed memory computers. Write a prototype implementation using MPI and discuss its scalability. 3. Consider a tiled lower triangular system :math:`L {\bf x} = {\bf b}`.