Solving Triangular Systems

Triangular linear system occur as the result of an LU or a QR decomposition.

Ill Conditioned Matrices and Quad Doubles

Consider the 4-by-4 lower triangular matrix

\[\begin{split}L = \left[ \begin{array}{rrrr} 1 & 0 & 0 & 0 \\ -2 & 1 & 0 & 0 \\ -2 & -2 & 1 & 0 \\ -2 & -2 & -2 & 1 \end{array} \right].\end{split}\]

What we know from numerical analysis:

  1. The condition number of a matrix magnifies roundoff errors.

  2. The hardware double precision is \(2^{-52} \approx 2.2 \times 10^{-16}\).

  3. We get no accuracy from condition numbers larger than \(10^{16}\).

An experiment in an interactive Julia session illustate that ill conditioned matrices occur already in modest dimensions.

julia> using LinearAlgebra

julia> A = ones(32,32);

julia> D = Diagonal(A);

julia> L = LowerTriangular(A);

julia> LmD = L - D;

julia> L2 = D - 2*LmD;

julia> cond(L2)
2.41631630569077e16

The condition number is estimated at \(2.4 \times 10^{16}\).

A floating-point number consists of a sign bit, exponent, and a fraction (also known as the mantissa). Almost all microprocessors follow the IEEE 754 standard. GPU hardware supports 32-bit (single float) and for compute capability \(\geq 1.3\) also double floats.

Numerical analysis studies algorithms for continuous problems, problems for their sensitivity to errors in the input; and algorithms for their propagation of roundoff errors.

The floating-point addition is not associative! Parallel algorithms compute and accumulate the results in an order that is different from their sequential versions. For example, adding a sequence of numbers is more accurate if the numbers are sorted in increasing order.

Instead of speedup, we can ask questions about quality up:

  • If we can afford to keep the total running time constant, does a faster computer give us more accurate results?

  • How many more processors do we need to guarantee a result?

A quad double is an unevaluated sum of 4 doubles, improves the working precision from \(2.2 \times 10^{-16}\) to \(2.4 \times 10^{-63}\). The software QDlib is presented in the paper Algorithms for quad-double precision floating point arithmetic by Y. Hida, X.S. Li, and D.H. Bailey, published in the 15th IEEE Symposium on Computer Arithmetic, pages 155-162. IEEE, 2001. The software is available at <http://crd.lbl.gov/~dhbailey/mpdist>.

A quad double builds on double double. Some features of working with doubles double are:

  • The least significant part of a double double can be interpreted as a compensation for the roundoff error.

  • Predictable overhead: working with double double is of the same cost as working with complex numbers.

Consider Newton’s method to compute :math:sqrt{x}: as defined in the code below.

#include <iostream>
#include <iomanip>
#include <qd/qd_real.h>
using namespace std;

qd_real newton ( qd_real x )
{
   qd_real y = x;
   for(int i=0; i<10; i++)
      y -= (y*y - x)/(2.0*y);
   return y;
}

The main program is as follows.

int main ( int argc, char *argv[] )
{
   cout << "give x : ";
   qd_real x; cin >> x;
   cout << setprecision(64);
   cout << "        x : " << x << endl;

   qd_real y = newton(x);
   cout << "  sqrt(x) : " << y << endl;

   qd_real z = y*y;
   cout << "sqrt(x)^2 : " << z << endl;

   return 0;
}

If the program is in the file newton4sqrt.cpp and the makefile contains

QD_ROOT=/usr/local/qd-2.3.13
QD_LIB=/usr/local/lib

newton4sqrt:
        g++ -I$(QD_ROOT)/include newton4sqrt.cpp \
              $(QD_LIB)/libqd.a -o /tmp/newton4sqrt

then we can create the executable, simply typing make newton4sqrt. A run with the code for \(sqrt{2}\) is shown below.

2.0000000000000000000000000000000000000000000000000000000000000000e+00
1.5000000000000000000000000000000000000000000000000000000000000000e+00
1.4166666666666666666666666666666666666666666666666666666666666667e+00
1.4142156862745098039215686274509803921568627450980392156862745098e+00
1.4142135623746899106262955788901349101165596221157440445849050192e+00
1.4142135623730950488016896235025302436149819257761974284982894987e+00
1.4142135623730950488016887242096980785696718753772340015610131332e+00
1.4142135623730950488016887242096980785696718753769480731766797380e+00
1.4142135623730950488016887242096980785696718753769480731766797380e+00
residual : 0.0000e+00

On a Parallel Shared Memory Computer with OpenMP

We will rewrite the formulas for forward substitution. Expanding the matrix-vector product \(L {\bf y}\) in \(L {\bf y} = {\bf b}\) leads to

\[\begin{split}\left\{ \begin{array}{lcr} y_1 & = & b_1 \\ \ell_{2,1} y_1 + y_2 & = & b_2 \\ \ell_{3,1} y_1 + \ell_{3,2} y_2 + y_3 & = & b_3 \\ & \vdots & \\ \ell_{n,1} y_1 + \ell_{n,2} y_2 + \ell_{n,3} y_3 + \cdots + \ell_{n,n-1} y_{n-1} + y_n & = & b_n \end{array} \right.\end{split}\]

and solving for the diagonal elements gives

\[\begin{split}\begin{array}{lcl} y_1 & = & b_1 \\ y_2 & = & b_2 - \ell_{2,1} y_1 \\ y_3 & = & b_3 - \ell_{3,1} y_1 - \ell_{3,2} y_2 \\ & \vdots & \\ y_n & = & b_n - \ell_{n,1} y_1 - \ell_{n,2} y_2 - \cdots - \ell_{n,n-1} y_{n-1} \\ \end{array}\end{split}\]

In rewriting the formulas, consider the case for \(n=5\). Solving \(L {\bf y} = {\bf b}\) for \(n = 5\):

  1. \({\bf y} := {\bf b}\)

  2. \(y_2 := y_2 - \ell_{2,1} \star y_1\)

    \(y_3 := y_3 - \ell_{3,1} \star y_1\)

    \(y_4 := y_4 - \ell_{4,1} \star y_1\)

    \(y_5 := y_5 - \ell_{5,1} \star y_1\)

  3. \(y_3 := y_3 - \ell_{3,2} \star y_2\)

    \(y_4 := y_4 - \ell_{4,2} \star y_2\)

    \(y_5 := y_5 - \ell_{5,2} \star y_2\)

  4. \(y_4 := y_4 - \ell_{4,3} \star y_3\)

    \(y_5 := y_5 - \ell_{5,3} \star y_3\)

  5. \(y_5 := y_5 - \ell_{5,4} \star y_4\)

In the algorithm

\[\begin{split}\begin{array}{l} {\bf y} := {\bf b} \\ \mbox{for } i \mbox{ from 2 to } n \mbox{ do} \\ \quad \mbox{for } j \mbox{ from } i \mbox{ to } n \mbox{ do} \\ \quad \quad y_j := y_j - \ell_{j,i-1} \star y_{i-1} \end{array}\end{split}\]

Observe that all instructions in the \(j\) loop are independent from each other! Considering the inner loop in the algorithm to solve \(L {\bf y} = {\bf b}\), we distribute the update of \(y_i, y_{i+1}, \ldots, y_n\) among \(p\) processors. If \(n \gg p\), then we expect a close to optimal speedup.

For our parallel solver for triangular systems:

  • For \(L = [ \ell_{i,j} ]\), we generate random numbers for \(\ell_{i,j} \in [0,1]\).

    The exact solution \(\bf y\): \(y_i = 1\), for \(i=1,2,\ldots,n\).

    We compute the right hand side \({\bf b} = L {\bf y}\).

  • Even already in small dimensions, the condition number may grow exponentially.

    Hardware double precision is insufficient. Therefore, we use quad double arithmetic.

  • We use a straightforward OpenMP implementation.

In solving random lower triangular systems, relying on hardware doubles is problematic:

$ time ./trisol 10
last number : 1.0000000000000009e+00

real   0m0.003s    user   0m0.001s    sys   0m0.002s

$ time ./trisol 100
last number : 9.9999999999974221e-01

real   0m0.005s    user   0m0.001s    sys   0m0.002s

$ time ./trisol 1000
last number : 2.7244600009080568e+04

real   0m0.036s    user   0m0.025s    sys   0m0.009s

For a matrix of quad doubles, allocating data in the main program is done in the code snippet below.

qd_real b[n],y[n];

qd_real **L;
L = (qd_real**) calloc(n,sizeof(qd_real*));
for(int i=0; i<n; i++)
   L[i] = (qd_real*) calloc(n,sizeof(qd_real));

srand(time(NULL));
random_triangular_system(n,L,b);

Generating a random triangular system happens through the function defined next.

void random_triangular_system
 ( int n, qd_real **L, qd_real *b )
{
   for(int i=0; i<n; i++)
   {
      L[i][i] = 1.0;
      for(j=0; j<i; j++)
      {
         double r = ((double) rand())/RAND_MAX;
         L[i][j] = qd_real(r);
      }
      for(int j=i+1; j<n; j++)
         L[i][j] = qd_real(0.0);
   }
   for(int i=0; i<n; i++)
   {
      b[i] = qd_real(0.0);
      for(int j=0; j<n; j++)
         b[i] = b[i] + L[i][j];
   }
}

Then the triangular system is solved by the following code.

void solve_triangular_system_swapped
 ( int n, qd_real **L, qd_real *b, qd_real *y )
{
   for(int i=0; i<n; i++) y[i] = b[i];

   for(int i=1; i<n; i++)
   {
      for(int j=i; j<n; j++)
         y[j] = y[j] - L[j][i-1]*y[i-1];
   }
}

Using OpenMP, we add directives, creating a parallel section, declaring which variables are shared, and which are not.

void solve_triangular_system_swapped
 ( int n, qd_real **L, qd_real *b, qd_real *y )
{
   int j;

   for(int i=0; i<n; i++) y[i] = b[i];

   for(int i=1; i<n; i++)
   {
      #pragma omp parallel shared(L,y) private(j)
      {
         #pragma omp for
         for(j=i; j<n; j++)
            y[j] = y[j] - L[j][i-1]*y[i-1];
      }
   }
}

For dimension \(n = 8,000\), for varying number p of cores, Table 21 summarizes the running of time ./trisol_qd_omp n p on 12-core Intel X5690, 3.47 GHz.

Table 21 solving a triangular system for p cores.

p

cpu time

real

user

sys

1

21.240s

35.095s

34.493s

0.597s

2

22.790s

25.237s

36.001s

0.620s

4

22.330s

19.433s

35.539s

0.633s

8

23.200s

16.726s

36.398s

0.611s

12

23.260s

15.781s

36.457s

0.626s

The serial part is the generation of the random numbers for \(L\) and the computation of \({\bf b} = L {\bf y}\). Recall Amdahl’s Law.

We can compute the serial time, subtracting for \(p=1\), from the real time the cpu time spent in the solver, i.e.: \(35.095 - 21.240 = 13.855\). For \(p = 12\), time spent on the solver is \(15.781 - 13.855 = 1.926\). Compare 1.926 to 21.240/12 = 1.770.

Accelerated Back Substitution

Consider a 3-by-3-tiled upper triangular system \(U {\bf x} = {\bf b}\).

\[\begin{split}U = \left[ \begin{array}{ccc} U_1 & A_{1,2} & A_{1,3} \\ & U_2 & A_{2,3} \\ & & U_3 \end{array} \right], \quad {\bf x} = \left[ \begin{array}{c} {\bf x}_1 \\ {\bf x}_2 \\ {\bf x}_3 \end{array} \right], \quad {\bf b} = \left[ \begin{array}{c} {\bf b}_1 \\ {\bf b}_2 \\ {\bf b}_3 \end{array} \right],\end{split}\]

where \(U_1\), \(U_2\), \(U_3\) are upper triangular, with nonzero diagonal elements.

Invert all diagonal tiles:

\[\begin{split}\left[ \begin{array}{ccc} U_1^{-1} & A_{1,2} & A_{1,3} \\ & U_2^{-1} & A_{2,3} \\ & & U_3^{-1} \end{array} \right].\end{split}\]
  • The inverse of an upper triangular matrix is upper triangular.

  • Solve an upper triangular system for each column of the inverse.

  • The columns of the inverse can be computed independently.

\(\Rightarrow\) Solve many smaller upper triangular systems in parallel.

After inverting the diagonal tiles, in the second stage, we solve \(U {\bf x} = {\bf b}\) for

\[\begin{split}U = \left[ \begin{array}{ccc} U_1^{-1} & A_{1,2} & A_{1,3} \\ & U_2^{-1} & A_{2,3} \\ & & U_3^{-1} \end{array} \right]\end{split}\]

in the following steps:

  1. \({\bf x}_3 := U_3^{-1} {\bf b}_3\),

  2. \({\bf b}_2 := {\bf b}_2 - A_{2,3} {\bf x}_3\), \({\bf b}_1 := {\bf b}_1 - A_{1,3} {\bf x}_3\),

  1. \({\bf x}_2 := U_2^{-1} {\bf b}_2\),

  2. \({\bf b}_1 := {\bf b}_1 - A_{1,2} {\bf x}_2\),

  3. \({\bf x}_1 := U_1^{-1} {\bf b}_1\).

Statements on the same line can be executed in parallel. In multiple double precision, several blocks of threads collaborate in the computation of one matrix-vector product.

The two stages are executed by three kernels.

Algorithm 1: Tiled Accelerated Back Substitution.

On input are the following :

  • \(N\) is the number of tiles,

  • \(n\) is the size of each tile,

  • \(U\) is an upper triangular \(Nn\)-by-\(Nn\) matrix,

  • \({\bf b}\) is a vector of size \(Nn\).

The output is \(\bf x\) is a vector of size \(Nn\): \(U {\bf x} = {\bf b}\).

  1. Let \(U_1, U_2, \ldots, U_N\) be the diagonal tiles.

    The k-th thread solves \(U_i {\bf v}_k = {\bf e}_k\), computing the k-th column \(U^{-1}_i\).

  2. For \(i=N, N-1, \ldots,1\) do

    1. \(n\) threads compute \({\bf x}_i = U^{-1} {\bf b}_i\);

    2. simultaneously update \({\bf b}_j\) with \({\bf b}_j - A_{j,i} {\bf x}_i\), \(j \in \{1,2,\ldots,i-1\}\) with \(i-1\) blocks of \(n\) threads.

A parallel execution could run in time proportional to \(N n\).

For efficiency, we must stage the data right. A matrix \(U\) of multiple doubles is stored as \([U_1, U_2, \ldots, U_m]\),

  • \(U_1\) holds the most significant doubles of \(U\),

  • \(U_m\) holds the least significant doubles of \(U\).

Similarly, \({\bf b}\) is an array of \(m\) arrays \([{\bf b}_1, {\bf b}_2, \ldots, {\bf b}_m]\), sorted in the order of significance. In complex data, real and imaginary parts are stored separately.

The main advantages of this representation are twofold:

  • facilitates staggered application of multiple double arithmetic,

  • benefits efficient memory coalescing, as adjacent threads in one block of threads read/write adjacent data in memory, avoiding bank conflicts.

In the experimental setup, about the input matrices:

  • Random numbers are generated for the input matrices.

  • Condition numbers of random triangular matrices almost surely grow exponentially [Viswanath and Trefethen, 1998].

  • In the standalone tests, the upper triangular matrices are the Us of an LU factorization of a random matrix, computed by the host.

Two input parameters are set for every run:

  • The size of each tile is the number of threads in a block. The tile size is a multiple of 32.

  • The number of tiles equals the number of blocks. As the V100 has 80 streaming multiprocessors, the number of tiles is at least 80.

The units of the flops in Table 22 and Table 23 are Gigaflops.

Table 22 Back substitution in double double precision on the V100.

stage in Algorithm 1

\(64 \times 80\)

\(128 \times 80\)

\(256 \times 80\)

invert diagonal tiles

1.2

9.3

46.3

multiply with inverses

1.7

3.3

8.9

back substitution

7.9

4.7

12.2

time spent by kernels

5.0

17.3

67.4

wall clock time

82.0

286.0

966.0

kernel time flops

190.6

318.7

525.1

wall clock flops

11.7

19.2

36.7

Table 23 Back substitution in quad double precision on the V100.

stage in Algorithm 1

\(64 \times 80\)

\(128 \times 80\)

\(256 \times 80\)

invert diagonal tiles

6.2

38.3

137.4

multiply with inverses

12.2

23.8

63.1

back substitution

13.3

26.7

112.2

time spent by kernels

31.7

88.8

312.7

wall clock time

187.0

619.0

2268.0

kernel time flops

299.4

614.2

1122.3

wall clock flops

50.8

88.1

154.8

Consider the doubling of the dimension and the precision.

  1. Double the dimension, expect the time to quadruple.

  2. From double double to quad double: 11.7 is multiplier, from quad double to octo double: 5.4 times longer.

_images/figbs4prc.png

Fig. 94 2-logarithms of times on the V100 in 3 precisions

In Fig. 94, the heights of the bars are closer to each other in higher dimensions.

The V100 has 80 multiprocessors, its theoretical peak performance is 1.68 times that of the P100.

The value for \(N\) is fixed at 80, \(n\) runs from 32 to 256, see Fig. 95.

_images/figdbl4tabs.png

Fig. 95 kernel times in quad double precision on 3 GPUs

In Fig. 95, observe the heights of the bars as the dimensions double and the relative performance of the three different GPUs.

Considering \(20480 = 320 \times 64 = 160 \times 128 = 80 \times 256\), we run back substitution in quad double precision, for \(20480 = N \times n\), for three different combinations of \(N\) and \(n\), on the V100. The results are summarized in Table 24.

Table 24 Back substitution in quad double precision.

stage in Algorithm 1

\(320 \times 64\)

\(160 \times 128\)

\(80 \times 256\)

invert diagonal tiles

13.5

35.8

132.3

multiply with inverses

49.0

47.5

64.3

back substitution

84.6

91.7

112.3

time spent by kernels

147.1

175.0

308.9

wall clock time

2620.0

2265.0

2071.0

kernel time flops

683.0

861.1

1136.1

wall clock flops

38.3

66.5

169.5

The units of all times in Table 24 are milliseconds, flops unit is Gigaflops.

Bibliography

  • T.J. Dekker. A floating-point technique for extending the available precision. Numerische Mathematik, 18(3):224-242, 1971.

  • D. H. Heller. A survey of parallel algorithms in numerical linear algebra. SIAM Review, 20(4):740-777, 1978.

  • Y. Hida, X.S. Li, and D.H. Bailey. Algorithms for quad-double precision floating point arithmetic. In 15th IEEE Symposium on Computer Arithmetic, pages 155-162. IEEE, 2001. Software at <http://crd.lbl.gov/~dhbailey/mpdist>.

  • N. J. Higham. Stability of parallel triangular system solvers. SIAM J. Sci. Comput., 16(2):400-413, 1995.

  • M. Joldes, J.-M. Muller, V. Popescu, W. Tucker. CAMPARY: Cuda Multiple Precision Arithmetic Library and Applications. In Mathematical Software – ICMS 2016, the 5th International Conference on Mathematical Software, pages 232-240, Springer-Verlag, 2016.

  • M. Lu, B. He, and Q. Luo. Supporting extended precision on graphics processors. In A. Ailamaki and P.A. Boncz, editors, Proceedings of the Sixth International Workshop on Data Management on New Hardware (DaMoN 2010), June 7, 2010, Indianapolis, Indiana, pages 19-26, 2010. Software at <https://code.google.com/archive/p/gpuprec>.

  • W. Nasri and Z. Mahjoub. Optimal parallelization of a recursive algorithm for triangular matrix inversion on MIMD computers. Parallel Computing, 27:1767-1782, 2001.

  • J. Verschelde. Least squares on GPUs in multiple double precision. In The 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 828-837. IEEE, 2022. Code at <https://github.com/janverschelde/PHCpack/src/GPU>.

  • D. Viswanath and L. N. Trefethen. Condition numbers of random triangular matrices. SIAM J. Matrix Anal. Appl., 19(2):564-581, 1998.

Exercises

  1. Write a parallel solver with OpenMP to solve \(U {\bf x} = {\bf y}\).

    Take for \(U\) a matrix with random numbers in \([0,1]\), compute \(\bf y\) so all components of \(\bf x\) equal one. Test the speedup of your program, for large enough values of \(n\) and a varying number of cores.

  2. Describe a parallel solver for upper triangular systems \(U {\bf y} = {\bf b}\) for distributed memory computers. Write a prototype implementation using MPI and discuss its scalability.

  3. Consider a tiled lower triangular system \(L {\bf x} = {\bf b}\).