GPU Accelerated Newton's Method for Taylor Series ================================================= The last application comes from computational algebra. Problems Statement and Overview ------------------------------- Given is 1. a system of polynomials in several variables, where the coefficients are truncated power series in one variable; and 2. one vector of truncated power series as an approximate solution. We assume that the problem is regular, that is: the series are Taylor series. The difficulty of running Newton's method is determined by 1. :math:`N`, the number of polynomials in the system; 2. :math:`n`, the number of variables in each polynomial; 3. :math:`d`, the degree of truncation of the series; and 4. :math:`\epsilon`, the working precision. The main questions concerning the scalability are below. 1. What are the dimensions :math:`(N, n, d, \epsilon)` to fully occupy the GPU? 2. What is the cost of evaluation and differentiation on a GPU compared to the overall cost of one Newton step? 3. Can GPU acceleration compensate for the cost overhead caused by multiple double arithmetic? For which :math:`\epsilon`? Working with truncated power series, computing modulo :math:`O(t^d)`, is doing arithmetic over the field of formal series :math:`{\mathbb C}[[t]]`. Linearization is applied: we consider :math:`{\mathbb C}^n[[t]]` instead of :math:`{\mathbb C}[[t]]^n`. Instead of a vector of power series, we consider a power series with vectors as coefficients. What needs to be solved is :math:`{\bf A} {\bf x} = {\bf b}`, :math:`{\bf A} \in {\mathbb C}^{n \times n}[[t]]`, :math:`{\bf b}, {\bf x} \in {\mathbb C}^n[[t]]`. For example: .. math:: \left[ \begin{array}{cccc} A_0 & & & \\ A_1 & A_0 & & \\ A_2 & A_1 & A_0 & \\ A_3 & A_2 & A_1 & A_0 \\ \end{array} \right] \left[ \begin{array}{c} {\bf x}_0 \\ {\bf x}_1 \\ {\bf x}_2 \\ {\bf x}_3 \end{array} \right] = \left[ \begin{array}{c} {\bf b}_0 \\ {\bf b}_1 \\ {\bf b}_2 \\ {\bf b}_3 \end{array} \right] linearizes :math:`{\bf A}(t) {\bf x}(t) = {\bf b}(t)`, with .. math:: {\bf A}(t) = A_0 + A_1 t + A_2 t^2 + A_3 t^3, .. math:: {\bf x}(t) = {\bf x}_0 + {\bf x}_1 t + {\bf x}_2 t^2 + {\bf x}_3 t^3, .. math:: {\bf b}(t) = {\bf b}_0 + {\bf b}_1 t + {\bf b}_2 t^2 + {\bf b}_3 t^3. Data Staging for Evaluation and Differentiation ----------------------------------------------- The convolution .. math:: \left( x_0 + x_1 t + \cdots + x_d t^d \right) \star \left( y_0 + y_1 t + \cdots + y_d t^d \right) has coefficients .. math:: z_k = \sum_{i=0}^k x_i \cdot y_{k-i}, \quad k = 0,1,\ldots,d. In the accelerated version, thread :math:`k` computes :math:`z_k`. To avoid thread divergence, pad with zeros, e.g., for :math:`d=3`, consider :numref:`figzeropadding`. .. _figzeropadding: .. figure:: ./figzeropadding.png :align: center Padding the second argument in a convolution with zeros introduces regularity: every thread performs the same number of operations. The arrays :math:`X`, :math:`Y`, and :math:`Z` shown in :numref:`figzeropadding` are in shared memory. After loading the coefficients of the series into shared memory arrays, thread :math:`k` computes .. math:: z_k = \sum_{i=0}^k x_i \cdot y_{k-i}, k = 0,1,\ldots,d. Pseudo code is listed below: 1. :math:`X_k := x_k` 2. :math:`Y_k := 0` 3. :math:`Y_{d+k} := y_k` 4. :math:`Z_k := X_0 \cdot Y_{d+k}` 5. for :math:`i` from 1 to :math:`d` do :math:`Z_k := Z_k + X_i \cdot Y_{d+k-i}` 6. :math:`z_k := Z_k` Evaluating and differentiating one monomial :math:`a~\!x_1 x_2 x_3 x_4 x_5` at a power series vector: .. math:: \begin{array}{lll} f_1 := a \star z_1 & b_1 := z_5 \star z_4 \\ f_2 := f_1 \star z_2 % f_2 := a z_1 \star z_2 & b_2 := b_1 \star z_3 \\ % & b_2 := z_5 z_4 \star z_3 \\ f_3 := f_2 \star z_3 % f_3 := a z_1 z_2 \star z_3 & b_3 := b_2 \star z_2 % & b_3 := z_5 z_4 z_3 \star z_2 & c_1 := f_1 \star b_2 \\ % & c_1 := a z_1 \star z_5 z_4 z_3 \\ f_4 := f_3 \star z_4 % f_4 := a z_1 z_2 z_3 \star z_4 & b_3 := b_3 \star a % & b_3 := z_5 z_4 z_3 z_2 \star a & c_2 := f_2 \star b_1 \\ % & c_2 := a z_1 z_2 \star z_5 z_4 \\ f_5 := f_4 \star z_5 & % f_5 := a z_1 z_2 z_3 z_4 \star z_5 & & c_3 := f_3 \star z_5 % & c_3 := a z_1 z_2 z_3 \star z_5 \end{array} * The gradient is in :math:`(b_3, c_1, c_2, c_3, f_4)` and the value is in :math:`f_5`. * Each :math:`\star` is a convolution between two truncated power series. * Statements on the same line can be computed in parallel. .. topic:: Cost of accelerated monomial evaluation and differentation. Given sufficiently many blocks of threads, monomial evaluation and differentiation takes :math:`n` steps for :math:`n` variables. Consider then the accelerated algorithmic differentiation of one polynomial, as in the example below. .. math:: \begin{array}{rccccccc} p & = ~ a_0~+~ & a_1 x_1 x_3 x_6 & + & a_2 x_1 x_2 x_5 x_6 & + & a_3 x_2 x_3 x_4 \\ & & f_{1,1} := a_1 \star z_1 & & f_{2,1} := a_2 \star z_1 & & f_{3,1} := a_3 \star z_2 \\ & & f_{1,2} := f_{1,1} \star z_3 & & f_{2,2} := f_{2,1} \star z_2 & & f_{3,2} := f_{3,1} \star z_3 \\ & & f_{1,3} := f_{1,2} \star z_6 & & f_{2,3} := f_{2,2} \star z_5 & & f_{3,3} := f_{3,2} \star z_4 \\ & & & & f_{2,4} := f_{2,3} \star z_6 & & \\ & & b_{1,1} := z_6 \star z_3 & & b_{2,1} := z_6 \star z_5 & & b_{3,1} := z_4 \star z_3 \\ & & b_{1,1} := b_{1,1} \star a_1 & & b_{2,2} := b_{2,1} \star z_2 & & b_{3,1} := b_{3,1} \star a_3 \\ & & & & b_{2,2} := b_{2,2} \star a_2 & & \\ & & c_{1,1} := f_{1,1} \star z_6 & & c_{2,1} := f_{2,1} \star b_{2,1} & & c_{3,1} := f_{3,1} \star z_4 \\ & & & & c_{2,2} := f_{2,2} \star z_6 \end{array} The 21 convolutions are arranged below (on same line: parallel execution): .. math:: \begin{array}{ccccccccc} f_{1,1} & b_{1,1} & & f_{2,1} & b_{2,1} & & f_{3,1} & b_{3,1} \\ f_{1,2} & b_{1,1} & c_{1,1} & f_{2,2} & b_{2,2} & c_{2,1} & f_{3,2} & b_{3,1} & c_{3,1} \\ f_{1,3} & & & f_{2,3} & b_{2,2} & c_{2,2} & f_{3,3} & & \\ & & & f_{2,4} & & & & & \\ \end{array} For efficient computation, we must take care about the staging of the data. For the example, the 21 convolutions .. math:: \begin{array}{ccccccccc} f_{1,1} & b_{1,1} & & f_{2,1} & b_{2,1} & & f_{3,1} & b_{3,1} \\ f_{1,2} & b_{1,1} & c_{1,1} & f_{2,2} & b_{2,2} & c_{2,1} & f_{3,2} & b_{3,1} & c_{3,1} \\ f_{1,3} & & & f_{2,3} & b_{2,2} & c_{2,2} & f_{3,3} & & \\ & & & f_{2,4} & & & & & \\ \end{array} are stored after the power series coefficients :math:`a_0`, :math:`a_1`, :math:`a_2`, :math:`a_3`, and the input power series :math:`z_1`, :math:`z_2`, :math:`z_3`, :math:`z_4`, :math:`z_5`, :math:`z_6` in the array :math:`A` in :numref:`figdatastagingarray`. .. _figdatastagingarray: .. figure:: ./figdatastagingarray.png :align: center One array :math:`A` to stage all coefficients, values of the variables, the forward, backward, and cross values for the evaluation and the differentiation. Observe in :numref:`figdatastagingarray`: * The arrows point at the start position of the input series and at the forward :math:`f_{i,j}`, backward :math:`b_{i,j}`, cross products :math:`c_{i,j}` for every monomial :math:`i`. * For complex and multiple double numbers, there are multiple data arrays, for real, imaginary parts, and for each double in a multiple double. One block of threads does one convolution. It takes four kernel launches, to compute the 21 convolutions .. math:: \begin{array}{ccccccccc} f_{1,1} & b_{1,1} & & f_{2,1} & b_{2,1} & & f_{3,1} & b_{3,1} \\ f_{1,2} & b_{1,1} & c_{1,1} & f_{2,2} & b_{2,2} & c_{2,1} & f_{3,2} & b_{3,1} & c_{3,1} \\ f_{1,3} & & & f_{2,3} & b_{2,2} & c_{2,2} & f_{3,3} & & \\ & & & f_{2,4} & & & & & \\ \end{array} respectively with 6, 9, 5, and 1 block of threads. .. topic:: Cost of accelerated convolutions for polynomial evaluation and differentiation. Given sufficiently many blocks of threads, all convolutions for the polynomial evaluation and differentiation take :math:`m` steps, where :math:`m` is the largest number of variables in one monomial. A convolution job :math:`j` is defined by the triplet :math:`(t_1(j), t_2(j), t_3(j))`, where * :math:`t_1(j)` and :math:`t_2(j)` are the locations in the array :math:`A` for the input, and * :math:`t_3(j)` is the location in :math:`A` of the output of the convolution. One block of threads does one addition of two power series. An addition job :math:`j` is defined by the pair :math:`(t_1(j), t_2(j))`, where * :math:`t_1(j)` is the location in the array :math:`A` of the input, and * :math:`t_2(j)` is the location in the array :math:`A` of the update. Jobs are defined recursively, following the tree summation algorithm. The job index :math:`j` corresponds to the block index in the kernel launch. .. topic:: Cost of accelerated polynomial evaluation and differentiation Given sufficiently many blocks of threads, polynomial evaluation and differentiation takes :math:`m + \lceil \log_2(N) \rceil` steps, where :math:`m` is the largest number of variables in one monomial and :math:`N` is the number of monomials. For :math:`N \gg m`, an upper bound for the speedup is :math:`d N/\log_2(N)`, where :math:`d` is the degree at which the series are truncated. The algorithmic differentiation was tested on five different GPUs. For the multiple double precision, code generated by the CAMPARY software, (by M. Joldes, J.-M. Muller, V. Popescu, W. Tucker, in ICMS 2016) was used on five different GPUs, with specifiation in :numref:`tabfiveGPUs` .. _tabfiveGPUs: .. table:: specifications on five different GPUs. +------------------+------+-----+-----------+--------+------+ | NVIDIA GPU | CUDA | #MP | #cores/MP | #cores | GHz | +==================+======+=====+===========+========+======+ | Tesla C2050 | 2.0 | 14 | 32 | 448 | 1.15 | +------------------+------+-----+-----------+--------+------+ | Kepler K20C | 3.5 | 13 | 192 | 2496 | 0.71 | +------------------+------+-----+-----------+--------+------+ | Pascal P100 | 6.0 | 56 | 64 | 3584 | 1.33 | +------------------+------+-----+-----------+--------+------+ | Volta V100 | 7.0 | 80 | 64 | 5120 | 1.91 | +------------------+------+-----+-----------+--------+------+ | GeForce RTX 2080 | 7.5 | 46 | 64 | 2944 | 1.10 | +------------------+------+-----+-----------+--------+------+ The double peak performance of the P100 is 4.7 TFLOPS. At 7.9 TFLOPS, the V100 is 1.68 times faster than the P100. To evaluate the algorithms, compare the ratios of the wall clock times on the P100 over V100 with the factor 1.68. Three polynomials :math:`p_1`, :math:`p_2`, and :math:`p_3` were used for testing. For each polynomial: * :math:`n` is the total number of variables, * :math:`m` is the number of variables per monomial, and * :math:`N` is the number of monomials (not counting the constant term). The last two columns of :numref:`tabtestpolynomials` list the number of convolution and addition jobs: .. _tabtestpolynomials: .. table:: characteristics of the test polynomials. +-------------+-----+-----+-------+--------+--------+ | | *n* | *m* | *N* | #cnv | #add | +=============+=====+=====+=======+========+========+ | :math:`p_1` | 16 | 4 | 1,820 | 16,380 | 9,084 | +-------------+-----+-----+-------+--------+--------+ | :math:`p_2` | 128 | 64 | 128 | 24,192 | 8,192 | +-------------+-----+-----+-------+--------+--------+ | :math:`p_3` | 128 | 2 | 8,128 | 24,256 | 24,256 | +-------------+-----+-----+-------+--------+--------+ Does the shape of the test polynomials influence the execution times? In evaluating :math:`p_1` for degree :math:`d = 152` in deca double precision, the ``cudaEventElapsedTime`` measures the times of kernel launches. The last line is the wall clock time for all convolution and addition kernels. All units in :numref:`tabperf5GPUs` are milliseconds. .. _tabperf5GPUs: .. table:: performance of the first test polynomial. +-------------+----------+----------+---------+--------+----------+ | | C2050 | K20C | P100 | V100 | RTX 2080 | +=============+==========+==========+=========+========+==========+ | convolution | 12947.26 | 11290.22 | 1060.03 | 634.29 | 10002.32 | +-------------+----------+----------+---------+--------+----------+ | addition | 10.72 | 11.13 | 1.37 | 0.77 | 5.01 | +-------------+----------+----------+---------+--------+----------+ | sum | 12957.98 | 11301.35 | 1061.40 | 635.05 | 10007.34 | +-------------+----------+----------+---------+--------+----------+ | wall clock | 12964.00 | 11309.00 | 1066.00 | 640.00 | 10024.00 | +-------------+----------+----------+---------+--------+----------+ * The :math:`12964/640 \approx 20.26` is for the V100 over the oldest C2050. * Compare the ratio of the wall clock times for P100 over V100 :math:`1066/640 \approx 1.67` with the ratios of theoretical double peak performance of the V100 of the P100: :math:`7.9/4.7 \approx 1.68`. In 1.066 second, the P100 did 16,380 convolutions, 9,084 additions. * One addition in deca double precision requires 139 additions and 258 subtractions of doubles, while one deca double multiplication requires 952 additions, 1743 subtractions, and 394 multiplications. * One convolution on series truncated at degree :math:`d` requires :math:`(d+1)^2` multiplications and :math:`d(d+1)` additions. * One addition of two series at degree :math:`d` requires :math:`d+1` additions. So we have :math:`16,380 (d+1)^2` multiplications and :math:`16,380 d(d+1) + 9,084(d+1)` additions in deca double precision. * One multiplication and one addition in deca double precision require respectively 3,089 and 397 double operations. * The :math:`16,380 (d+1)^2` evaluates to 1,184,444,368,380 and :math:`16,380 d(d+1) + 9,084(d+1)` to 151,782,283,404 operations. In total, in 1.066 second the P100 did 1,336,226,651,784 double float operations, reaching a performance of 1.25 TFLOPS. To examine the scalability, consider 1. the doubling of the precision, in double, double double, quad double and octo double precision, for fixed degree 191; and 2. the doubling of the number of coefficients of the series, from 32 to 64, and from 64 to 128. For increasing precision, the problem becomes more compute bound. Wall clock times are shown in :numref:`figdoubles` to illustrate the cost of doubling the precision. .. _figdoubles: .. figure:: ./figdoubles.png :align: center The 2-logarithm of the wall clock times to evaluate and differentiate :math:`p_1`, :math:`p_2`, :math:`p_3` in double (1d), double double (2d), quad double (4d), and octo double (8d) precision, for power series truncated at degree 191. The effect of doubling the number of coefficients of the series is shown in :numref:`figscale`. .. _figscale: .. figure:: ./figscale.png :align: center The 2-logarithm of the wall clock times to evaluate and differentiate :math:`p_1` in quad double (4d), penta double (5d), octo double (8d), and deca double (10d) precision, for power series of degrees 31, 63, and 127. Accelerated Newton's Method --------------------------- To solve .. math:: \left[ \begin{array}{cccc} A_0 & & & \\ A_1 & A_0 & & \\ A_2 & A_1 & A_0 & \\ A_3 & A_2 & A_1 & A_0 \\ \end{array} \right] \left[ \begin{array}{c} {\bf x}_0 \\ {\bf x}_1 \\ {\bf x}_2 \\ {\bf x}_3 \end{array} \right] = \left[ \begin{array}{c} {\bf b}_0 \\ {\bf b}_1 \\ {\bf b}_2 \\ {\bf b}_3 \end{array} \right] in parallel, the task graph to solve a triangular block Toeplitz system is show in :numref:`figtaskgraphQRtoeplitz`. .. _figtaskgraphQRtoeplitz: .. figure:: ./figtaskgraphQRtoeplitz.png :align: center A task graph for a triangular block Toeplitz system. In the setup, we consider :math:`n` variables :math:`\bf x`, an $n$-by-$n$ exponent matrix :math:`A`, and :math:`{\bf b}(t)` is a vector of :math:`n` series of order :math:`O(t^d)`: .. math:: {\bf x}^A = {\bf b}(t) \quad \mbox{is a {\em monomial homotopy}.} For example, :math:`n=3`, :math:`{\bf x} = [x_1, x_2, x_3]`: .. math:: A = \left[ \begin{array}{ccc} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \end{array} \right] \quad \left\{ \begin{array}{lcl} x_1 & \!\!\!=\!\!\! & b_1(t) \\ x_1 x_2 & \!\!\!=\!\!\! & b_2(t) \\ x_1 x_2 x_3 & \!\!\!=\!\!\! & b_3(t) \end{array} \right. \quad \begin{array}{l} x_1(t) = \exp(\alpha_1 t) + O(t^d) \\ x_2(t) = \exp(\alpha_2 t) + O(t^d) \\ x_3(t) = \exp(\alpha_3 t) + O(t^d) \end{array} where .. math:: \exp(\alpha t) + O(t^4) = 1 + \alpha t + \frac{\alpha^2}{2!} t^2 + \frac{\alpha^3}{3!} t^3 + O(t^4), with :math:`\alpha \in [-1, -1 + \delta] \cup [1 - \delta, 1]`, :math:`\delta > 0`, or :math:`|\alpha| = 1` for random :math:`\alpha \in {\mathbb C}`. To motivate the use of multiple double precision, consider the following series expansion: .. math:: \exp(t) = \sum_{k=0}^{d-1} \frac{t^k}{k!} + O(t^d) In :numref:`tabexpfactorial`, the magnitude of the last coefficient is related to the required precision. .. _tabexpfactorial: .. table:: Recommended precision for factorial coefficients. +-----------+--------------+-----------------------+ | :math:`k` | :math:`1/k!` | recommended precision | +===========+==============+=======================+ | 7 | ``2.0e-004`` | double precision okay | +-----------+--------------+-----------------------+ | 15 | ``7.7e-013`` | use double doubles | +-----------+--------------+-----------------------+ | 23 | ``3.9e-023`` | use double doubles | +-----------+--------------+-----------------------+ | 31 | ``1.2e-034`` | use quad doubles | +-----------+--------------+-----------------------+ | 47 | ``3.9e-060`` | use octo doubles | +-----------+--------------+-----------------------+ | 63 | ``5.0e-088`` | use octo doubles | +-----------+--------------+-----------------------+ | 95 | ``9.7e-149`` | need hexa doubles | +-----------+--------------+-----------------------+ | 127 | ``3.3e-214`` | need hexa doubles | +-----------+--------------+-----------------------+ The monomial homotopies allow to set the scalability parameters. 1. Going from one to two columns of monomials: .. math:: {\bf c}_1 {\bf x}^{A_1} + {\bf c}_2 {\bf x}^{A_2} = {\bf b}(t), for two :math:`n`-vectors :math:`{\bf c}_1` and :math:`{\bf c}_2` and two exponent matrices :math:`A_1` and :math:`A_2`. 2. For increasing dimensions: :math:`n=64, 128, 256, 512, 1024`. 3. For increasing orders :math:`d` in :math:`O(t^d)`, :math:`d = 1, 2, 4, 8, 16, 32, 64`. 4. For increasing precision: double, double double, quad double, octo double. Doubling columns, dimensions, orders, and precision, how much of the overhead can be compensated by GPU acceleration? This application combines different types of accelerated computations: 1. convolutions for evaluation and differentiation 2. Householder QR 3. :math:`Q^H {\bf b}` computations 4. back substitutions to solve :math:`R {\bf x} = Q^H {\bf b}` 5. updates :math:`{\bf b} = {\bf b} - A {\bf x}` 6. residual computations :math:`\| {\bf b} - A {\bf x} \|_1` Which of the six types occupies most time? The computations are staggered. In computing :math:`{\bf x}(t) = {\bf x}_0 + {\bf x}_1 t + {\bf x}_2 t^2 + \cdots + {\bf x}_{d-1} t^{d-1}`, observe: 1. Start :math:`{bf x}_0` with half its precision correct, otherwise Newton's method may not converge. 2. Increase :math:`d` in the order :math:`O(t^d)` gradually, e.g.: the new :math:`d` is :math:`d + 1 + d/2`, hoping (at best) for quadratic convergence. 3. Once :math:`{\bf x}_k` is correct, the corresponding :math:`{\bf b}_k = 0`, as :math:`{\bf b}_k` is obtained by evaluation, and then the update :math:`\Delta {\bf x}_k` should no longer be computed because .. math:: QR \Delta {\bf x}_k = {\bf b}_k = {\bf 0} \quad \Rightarrow \quad \Delta {\bf x}_k = {\bf 0}. This gives a criterion to stop the iterations. To see which type of computation occupies the largest portion of the time, consider :numref:`figdata8d1024rd63`. and :numref:`figdata8d2c1024rd63`. .. _figdata8d1024rd63: .. figure:: ./figdata8d1024rd63.png :align: center Six different types of accelerated computations, on the V100, on one column of monomials, triangular exponent matrix of ones, :math:`n=1024`, :math:`d = 64`, 24 steps in octo double precision. .. _figdata8d2c1024rd63: .. figure:: ./figdata8d2c1024rd63.png :align: center Six different types of accelerated computations, on the V100, On *two* columns of monomials, triangular exponent matrix of ones, :math:`n=1024`, :math:`d = 64`, 24 steps in octo double precision. Doubling the precision less than doubles the wall clock time and increases the time spent by all kernels, as illustrated by :numref:`figWall1024rd63`. .. _figWall1024rd63: .. figure:: ./figWall1024rd63.png :align: center On one column of monomials, triangular exponent matrix of ones, :math:`n=1024`, :math:`d=64`, 24 steps, 2-logarithms of times, in 4 different precisions, on the V100. Teraflop performance of convolutions is illustrated in :numref:`figdataCNV8d1024rd63`. .. _figdataCNV8d1024rd63: .. figure:: ./figdataCNV8d1024rd63.png :align: center On one column of monomials, triangular exponent matrix of ones, :math:`n=1024`, performance of the evaluation and differentiation, in octo double precision, for increasing orders of the series. After degree 40, teraflop performance is observed on the V100. Bibliography ------------ * **Accelerated polynomial evaluation and differentiation at power series in multiple double precision.** In the *2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)*, pages 740-749. IEEE, 2021. ``arXiv:2101.10881`` * **Least squares on GPUs in multiple double precision.** In the *2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)*, pages 828-837. IEEE, 2022. ``arXiv:2110.08375`` * **GPU Accelerated Newton for Taylor Series Solutions of Polynomial Homotopies in Multiple Double Precision.** In the *Proceedings of the 26th International Workshop on Computer Algebra in Scientific Computing (CASC 2024)*, pages 328-348. Springer, 2024. ``arXiv:2301.12659`` The code at is released under the GPL-3.0 License.