Parallel Numerical Integration ============================== The problem of numerical integeration is another illustration of an ideal parallel computation: the communication cost is constant and the amount of computational work increases as the number of function evaluations increases. The complexity of the function and the working precision is another factor which increases the computational cost. Numerical Integration --------------------- In numerical integration we consider the problem of approximating the definite integral of a function over a domain. By domain decomposition we naturally arrive at parallel algorithms. Let :math:`a < b` and consider :math:`a = c_0 < c_1 < \cdots < c_{p-2} < c_{p-1} = b`, then: .. math:: \int_a^b f(x) dx = \sum_{k=1}^{p} \int_{c_{k-1}}^{c_{k}} f(x) dx. We have *p* subintervals of :math:`[a,b]` and on each subinterval :math:`[c_{k-1},c_k]` we apply a :index:`quadrature` formula (weighted sum of function values): .. math:: \int_{c_{k-1}}^{c_k} f(x) dx \approx \sum_{j=1}^n w_j f(x_j) where the weights :math:`w_j` correspond to points :math:`x_j \in [c_{k-1},c_k]`. Let the domain :math:`D` be partitioned as :math:`\displaystyle \bigcup_{i=1}^n \Delta_i`: .. math:: \int_D f(x_1,x_2) d x_1 d x_2 = \sum_{k=1}^n \int_{\Delta_k} f(x_1,x_2) d x_1 d x_2. For a triangle :math:`\Delta`, an approximation of the integral of :math:`f` over :math:`\Delta` is to take the volume between the plane spanned by the function values at the corners of :math:`\Delta` the :math:`x_1 x_2`-plane. Finer domain decompositions of :math:`D` lead to more triangles :math:`\Delta`, more function evaluations, and more accurate approximations. Like Monte Carlo simulation, numerical integration is :index:`pleasingly parallel`. The function evaluations can be computed independently from each other. No communication between processors needed once the subdomains have been distributed. The size of all communication is small. On input we have the definition of the subdomain, and on return is one weighted sum of function values. To obtain highly accurate values when applying extrapolation on the :index:`trapezoidal rule` (so-called :index:`Romberg integration`), we use quad double arithmetic. A :index:`quad double` is an unevaluated sum of 4 doubles, improves working precision from :math:`2.2 \times 10^{-16}` to :math:`2.4 \times 10^{-63}`. A quad double builds on the ``double double``. The least significant part of a ``double double`` can be interpreted as a compensation for the roundoff error. Predictable overhead: working with ``double double`` is of the same cost as working with complex numbers. The QD library supports operator overloading in C++, as shown in the example code below. :: #include #include #include using namespace std; int main ( void ) { qd_real q("2"); cout << setprecision(64) << q << endl; for(int i=0; i<8; i++) { qd_real dq = (q*q - 2.0)/(2.0*q); q = q - dq; cout << q << endl; } cout << scientific << setprecision(4) << "residual : " << q*q - 2.0 << endl; return 0; Compiling and running the program could go as below. :: $ g++ -I/usr/local/qd-2.3.17/include qd4sqrt2.cpp /usr/local/lib/libqd.a -o /tmp/qd4sqrt2 $ /tmp/qd4sqrt2 2.0000000000000000000000000000000000000000000000000000000000000000e+00 1.5000000000000000000000000000000000000000000000000000000000000000e+00 1.4166666666666666666666666666666666666666666666666666666666666667e+00 1.4142156862745098039215686274509803921568627450980392156862745098e+00 1.4142135623746899106262955788901349101165596221157440445849050192e+00 1.4142135623730950488016896235025302436149819257761974284982894987e+00 1.4142135623730950488016887242096980785696718753772340015610131332e+00 1.4142135623730950488016887242096980785696718753769480731766797380e+00 1.4142135623730950488016887242096980785696718753769480731766797380e+00 residual : 0.0000e+00 $ Instead of typing in all arguments to ``g++`` we better work with a ``makefile`` which contains :: QD_ROOT=/usr/local/qd-2.3.17 QD_LIB=/usr/local/lib qd4sqrt2: g++ -I$(QD_ROOT)/include qd4sqrt2.cpp \ $(QD_LIB)/libqd.a -o qd4sqrt2 Then we can simply type ``make qd4sqrt2`` at the command prompt to build the executable ``qd4sqrt2``. Returning to the problem of approximating :math:`\pi` we consider :math:`\displaystyle \pi = \int_0^1 \frac{16 x - 16}{x^4 - 2 x^3 + 4 x - 4} dx`. We apply the composite Trapezoidal rule doubling in each step the number of subintervals of :math:`[0,1]`. Recycling the function evaluations, the next approximation requires as many function evaluations as in the previous step. To accelerate the convergence, extrapolate on the errors: .. math:: T[i][j] = \frac{T[i][j-1] 2^{2j} - T[i-1][j-1]}{2^{2j} - 1}, \quad T[i][0] = T\left(\frac{h}{2^i}\right). where .. math:: T(h) = \frac{h}{2}(f(a) + f(b)) + h \sum_{k=1}^{n-1} f(a+kh), \quad h = \frac{b-a}{n}. Running the program produces the following approximations for :math:`\pi`, improved with Romberg integration. :: $ /tmp/romberg4piqd Give n : 20 Trapezoidal rule : 2.0000000000000000000000000000000000000000000000000000000000000000e+00 -1.1e+00 2.8285714285714285714285714285714285714285714285714285714285714286e+00 -3.1e-01 3.0599849140217096656334342993859535399012339858804671938901787924e+00 -8.2e-02 3.1208799149782192753090608934264402720707593687963124445027369600e+00 -2.1e-02 3.1363921117379842496423260900707118017245006984692475608300990573e+00 -5.2e-03 3.1402910615222379836674014569717873479608997416673717997690893828e+00 -1.3e-03 3.1412671635291865246174301601168385100304285866900551637643442050e+00 -3.3e-04 3.1415112753058333508223495008044322351701491319859264242480853344e+00 -8.1e-05 3.1415723086580001130741506654555989016008550279324178879004173905e+00 -2.0e-05 3.1415875673342908067733429372872267595388334364564318736724256675e+00 -5.1e-06 3.1415913820245079343519112519075857885126411600435940845928207730e+00 -1.3e-06 3.1415923356983838054575966637759377218669127222763583195386129930e+00 -3.2e-07 3.1415925741169353735102088131178638490686580423772000985781988075e+00 -7.9e-08 3.1415926337215784280554756890689392085987403031281282606384169231e+00 -2.0e-08 3.1415926486227395143502815854605592727657670704677224154970568916e+00 -5.0e-09 3.1415926523480298060901422591278382431441533032555099184797306336e+00 -1.2e-09 3.1415926532793523802854924341737272725049748286771339948938222906e+00 -3.1e-10 3.1415926535121830239131040417347661807897041512988564129230237347e+00 -7.8e-11 3.1415926535703906848249303226263308523017737071795040403161676085e+00 -1.9e-11 3.1415926535849426000531946040370197046578880393737659063268947933e+00 -4.9e-12 Romberg integration : 2.0000000000000000000000000000000000000000000000000000000000000000e+00 -1.1e+00 3.1047619047619047619047619047619047619047619047619047619047619048e+00 -3.7e-02 3.1392801316880188260437414797616101138912788116649184217866669809e+00 -2.3e-03 3.1414830360866343425236605718165706514196020802703167069117496616e+00 -1.1e-04 3.1415911260349351404190698539313113170056259534276006755868536015e+00 -1.5e-06 3.1415926431831900702282508859066437449426241659436386656679964387e+00 -1.0e-08 3.1415926535646358543989993705490112789767840363176063755945965454e+00 -2.5e-11 3.1415926535897718115554422419526551887771361680030439892483820211e+00 -2.1e-14 3.1415926535897932322804851779277884529440555741904175076351878046e+00 -6.2e-18 3.1415926535897932384620617895102904628866329276580937738783309287e+00 -5.8e-22 3.1415926535897932384626433658715764722081660961614203175123699605e+00 -1.7e-26 3.1415926535897932384626433832793408245588305577598196922499346428e+00 -1.6e-31 3.1415926535897932384626433832795028837365190111401005283154516696e+00 -4.6e-37 3.1415926535897932384626433832795028841971690058832160185814954870e+00 -3.9e-43 3.1415926535897932384626433832795028841971693993750061822859693780e+00 -1.0e-49 3.1415926535897932384626433832795028841971693993751058209675539509e+00 -7.4e-57 3.1415926535897932384626433832795028841971693993751058209749445922e+00 -5.7e-65 3.1415926535897932384626433832795028841971693993751058209749445904e+00 -1.9e-63 3.1415926535897932384626433832795028841971693993751058209749445890e+00 -3.3e-63 3.1415926535897932384626433832795028841971693993751058209749445875e+00 -4.8e-63 elapsed time : 2.040 seconds The functions defining the composite Trapezoidal rule and Romberg integration in C++ are listed below. :: vector comptrap ( qd_real f ( qd_real x ), qd_real a, qd_real b, int n ) { vector t(n); qd_real h = b - a; t[0] = (f(a) + f(b))*h/2; for(int i=1, m=1; i& t ) { int n = t.size(); qd_real e[n][n]; int m = 0; for(int i=0; i #include #include #include #include "tbb/tbb.h" #include "tbb/blocked_range.h" #include "tbb/parallel_reduce.h" #include "tbb/task_scheduler_init.h" using namespace std; using namespace tbb; Then in the class definition, we start with the data attributes: the address to the start of the sequence of quad doubles and the sum of the sequence. The constructor copies the given address to the data attribute which stores the address of the sequence. The constructor initializes the sum to zero. :: class SumQuadDoubles { qd_real *data; public: qd_real sum; SumQuadDoubles ( qd_real *d ) : data(d), sum(0.0) {} The parallelism is defined via the ``blocked_range``, and by the operators to split and join. :: void operator() ( const blocked_range& r ) { qd_real s = qd_real(sum[0],sum[1],sum[2],sum[3]); // must accumulate ! qd_real *d = data; size_t end = r.end(); for(size_t i=r.begin(); i != end; ++i) s += d[i]; sum = qd_real(s[0],s[1],s[2],s[3]); } // the splitting constructor SumQuadDoubles ( SumQuadDoubles& x, split ) : data(x.data), sum(0.0) {} // the join method does the merge void join ( const SumQuadDoubles& x ) { sum += x.sum; } The class definition is used in the following function: :: qd_real ParallelSum ( qd_real *x, size_t n ) { SumQuadDoubles S(x); parallel_reduce(blocked_range(0,n), S); return S.sum; } The ``ParallelSum`` is called in the main program as follows: :: qd_real *d; d = (qd_real*)calloc(n,sizeof(qd_real)); for(int i=0; i. Exercises --------- 1. Make the OpenMP implementation of ``romberg4piqd_omp.cpp`` more general by prompting the user for a number of threads and then using those threads in the function ``comptrap``. Compare the speed up for 2, 4, 8, and 16 threads. 2. Write an elaborate description on the thread creation and synchronization issues with Pthreads to achieve a parallel version of ``romberg4piqd.cpp``. 3. Use the Intel Threading Building Blocks to write a parallel version of the composite trapezoidal rule in quad double arithmetic.