Parallel FFT and Sorting ======================== As our third application for parallel algorithms, we consider the Fast Fourier Transform (FFT). Instead of investigating the algorithmic aspects in detail, we introduce the FFT with an excellent software library FFTW which supports for OpenMP. Both the FFT and Sorting have a complexity of :math:`O(n \log(n))` where :math:`n` is the size of the input. The Fast Fourier Transform in Parallel -------------------------------------- A periodic function :math:`f(t)` can be written as a series of sinusoidal waveforms of various frequencies and amplitudes: the Fourier series. The Fourier transform maps :math:`f` from the time to the frequency domain. In discrete form: .. math:: F_k = \frac{1}{n} \sum_{j=0}^{n-1} f_j e^{-2 \pi i (jk/n)}, \quad k=0,1,\ldots,n-1, f_k = f(x_k). The Discrete Fourier Transform (DFT) maps a convolution into a componentwise product. The Fast Fourier Transform is an algorithm that reduces the cost of the DFT from :math:`O(n^2)` to :math:`O(n \log(n))`, for length :math:`n`. The are many applications, for example: signal and image processing. FFTW (the Fastest Fourier Transform in the West) is a library for the Discrete Fourier Transform (DFT), developed at MIT by Matteo Frigo and Steven G. Johnson available under the GNU GPL license at . FFTW received the 1999 J. H. Wilkinson Prize for Numerical Software. FFTW 3.3.3 supports MPI and comes with multithreaded versions: with Cilk, Pthreads and OpenMP are supported. Before ``make install``, do :: ./configure --enable-mpi --enable-openmp --enable-threads We illustrate the use of FFTW for signal processing. Consider :math:`f(t) = 2 \cos(4 \times 2 \pi t) + 5 \sin(10 \times 2 \pi t)`, for :math:`t \in [0,1]`. Its plot is in :numref:`figtestsignal`. .. _figtestsignal: .. figure:: ./figtestsignal.png :align: center A simple test signal. Compiling and running the code below gives the frequencies and amplitudes of the test signal. :: $ make fftw_use gcc fftw_use.c -o /tmp/fftw_use -lfftw3 -lm $ /tmp/fftw_use scanning through the output of FFTW... at 4 : (2.56000e+02,-9.63913e-14) => frequency 4 and amplitude 2.000e+00 at 10 : (-7.99482e-13,-6.40000e+02) => frequency 10 and amplitude 5.000e+00 $ Calling the FFTW software on a test signal: :: #include #include #include #include double my_signal ( double t ); /* * Defines a signal composed of a cosine of frequency 4 and amplitude 2, * a sine of frequency 10 and amplitude 5. */ void sample_signal ( double f ( double t ), int m, double *x, double *y ); /* * Takes m = 2^k samples of the signal f, returns abscisses in x and samples in y. */ int main ( int argc, char *argv[] ) { const int n = 256; double *x, *y; x = (double*)calloc(n,sizeof(double)); y = (double*)calloc(n,sizeof(double)); sample_signal(my_signal,n,x,y); int m = n/2+1; fftw_complex *out; out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*m); fftw_plan p; p = fftw_plan_dft_r2c_1d(n,y,out,FFTW_ESTIMATE); fftw_execute(p); printf("scanning through the output of FFTW...\n"); double tol = 1.0e-8; /* threshold on noise */ int i; for(i=0; i tol) { printf("at %d : (%.5e,%.5e)\n", i,out[i][0],out[i][1]); printf("=> frequency %d and amplitude %.3e\n", i,v); } } return 0; } Next is the running the OpenMP version of FFTW. For timing purposes, we define a program that takes in the parameters at the command line. The parameters are ``m`` and ``n``: We will run the fftw ``m`` times for dimension ``n``. The setup of our experiment is in the code below followed by the parallel version with OpenMP. Setup of the experiment to execute the fftw: :: #include #include #include #include #include void random_sequence ( int n, double *re, double *im ); /* * Returns a random sequence of n complex numbers * with real parts in re and imaginary parts in im. */ int main ( int argc, char *argv[] ) { int n,m; if(argc > 1) n = atoi(argv[1]); else printf("please provide dimension on command line\n"); m = (argc > 2) ? atoi(argv[2]) : 1; double *x, *y; x = (double*)calloc(n,sizeof(double)); y = (double*)calloc(n,sizeof(double)); random_sequence(n,x,y); fftw_complex *in,*out; in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*n); out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*n); fftw_plan p; p = fftw_plan_dft_1d(n,in,out,FFTW_FORWARD,FFTW_ESTIMATE); clock_t tstart,tstop; tstart = clock(); int i; for(i=0; i #include int main ( int argc, char *argv[] ) { int t; /* number of threads */ t = (argc > 3) ? atoi(argv[3]) : 1; int okay = fftw_init_threads(); if(okay == 0) printf("error in thread initialization\n"); omp_set_num_threads(t); fftw_plan_with_nthreads(t); To define the compilation in the makefile, we must link with the OpenMP library version fftw3 and with the fftw3 library: :: fftw_timing_omp: gcc -fopenmp fftw_timing_omp.c -o /tmp/fftw_timing_omp \ -lfftw3_omp -lfftw3 -lm Results are in :numref:`tabfftwomptimes1` and :numref:`tabfftwomptimes2`. .. _tabfftwomptimes1: .. table:: Running times (with the unix command ``time``), for computing 1,000 DFTs for :math:`n = 100,000`, with *p* threads, where real = wall clock time, user = cpu time, sys = system time, obtained on two 8-core processors at 2.60 Ghz. +-----+--------+--------+--------+----------+ | *p* | real | user | sys | speed up | +=====+========+========+========+==========+ | 1 | 2.443s | 2.436s | 0.004s | 1 | +-----+--------+--------+--------+----------+ | 2 | 1.340s | 2.655s | 0.010s | 1.823 | +-----+--------+--------+--------+----------+ | 4 | 0.774s | 2.929s | 0.008s | 3.156 | +-----+--------+--------+--------+----------+ | 8 | 0.460s | 3.593s | 0.017s | 5.311 | +-----+--------+--------+--------+----------+ | 16 | 0.290s | 4.447s | 0.023s | 8.424 | +-----+--------+--------+--------+----------+ .. _tabfftwomptimes2: .. table:: Running times (with the unix command ``time``), for computing 1,000 DFTs for :math:`n = 1,000,000`, with *p* threads, where real = wall clock time, user = cpu time, sys = system time, obtained on two 8-core processors at 2.60 Ghz. +-----+---------+---------+--------+----------+ | *p* | real | user | sys | speed up | +=====+=========+=========+========+==========+ | 1 | 44.806s | 44.700s | 0.014s | 1 | +-----+---------+---------+--------+----------+ | 2 | 22.151s | 44.157s | 0.020s | 2.023 | +-----+---------+---------+--------+----------+ | 4 | 10.633s | 42.336s | 0.019s | 4.214 | +-----+---------+---------+--------+----------+ | 8 | 6.998s | 55.630s | 0.036s | 6.403 | +-----+---------+---------+--------+----------+ | 16 | 3.942s | 62.250s | 0.134s | 11.366 | +-----+---------+---------+--------+----------+ Comparing the speedups in :numref:`tabfftwomptimes1` with those in :numref:`tabfftwomptimes2`, we observe that speedups improve with a ten fold increase of :math:`n`. Bucket Sort for Distributed Memory ---------------------------------- Given are :math:`n` numbers, suppose all are in :math:`[0,1]`. The algorithm using :math:`p` buckets proceeds in two steps: 1. Partition numbers :math:`x` into :math:`p` buckets: .. math:: x \in [i/p,(i+1)/p[ \quad \Rightarrow \quad x \in (i+1)\mbox{th bucket}. 2. Sort all :math:`p` buckets. The cost to partition the numbers into :math:`p` buckets is :math:`O(n \log_2(p))`. Note: radix sort uses most significant bits to partition. In the best case: every bucket contains :math:`n/p` numbers. The cost of Quicksort is :math:`O(n/p \log_2(n/p))` per bucket. Sorting :math:`p` buckets takes :math:`O(n \log_2(n/p))`. The total cost is :math:`O(n ( \log_2(p) + \log_2(n/p) ))`. On :math:`p` processors, all nodes sort: 1. Root node distributes numbers: processor *i* gets *i*-th bucket. 2. Processor *i* sorts *i*-th bucket. 3. Root node collects sorted buckets from processors. Is it worth it? Recall: serial cost is :math:`n ( \log_2(p) + \log_2(n/p) )`. The cost of parallel algorithm is * :math:`n \log_2(p)` to place numbers into buckets, * :math:`n/p \log_2(n/p)` to sort buckets. .. math:: \mbox{speedup} = \frac{n ( \log_2(p) + \log_2(n/p) )} {n ( \log_2(p) + \log_2(n/p)/p )}, or equivalently, .. math:: \mbox{speedup} = \frac{1 + L}{1 + L/p} = \frac{1+L}{(p+L)/p} = \frac{p}{p+L} (1 + L), \quad L = \frac{\log_2(n/p)}{\log_2(p)}. Let us compare to quicksort. .. math:: \begin{array}{rcl} \mbox{speedup} & = & {\displaystyle \frac{n \log_2(n)}{n(\log_2(p) + n/p \log_2(n/p))}} \\ & = & {\displaystyle \frac{\log_2(n)}{\log_2(p) + 1/p (\log_2(n)-\log_2(p))}} \\ & = & {\displaystyle \frac{\log_2(n)}{1/p (\log_2(n) + (1 - 1/p) \log_2(p))}} \end{array} For example, consider: :math:`n = 2^{20}`, :math:`\log_2(n) = 20`, :math:`p = 2^2`, :math:`\log_2(p) = 2`, .. math:: \begin{array}{rcl} \mbox{speedup} & = & {\displaystyle \frac{20}{1/4 (20) + (1 - 1/4) 2}} \\ & = & {\displaystyle \frac{20}{5 + 3/2} = \frac{40}{13} \approx 3.08.} \end{array} Next the communication versus computation costs are examined. The scatter of $n$ data elements costs :math:`t_{\rm start~up} + n t_{\rm data}`, where :math:`t_{\rm data}` is the cost of sending 1 data element. For distributing and collecting of all buckets, the total communication time is .. math:: 2 p \left( t_{\rm start~up} + \frac{n}{p} t_{\rm data}\right). The computation/communication ratio is .. math:: \frac{(n \log_2(p) + n/p \log_2(n/p) ) t_{\rm compare} } {2 p \left( t_{\rm start~up} + \frac{n}{p} t_{\rm data}\right)} where :math:`t_{\rm compare}` is the cost for one comparison. We view this ratio for :math:`n \gg p`, for fixed :math:`p`, so: .. math:: \frac{n}{p} \log_2 \left( \frac{n}{p} \right) = \frac{n}{p} \left( \log_2(n) - \log_2(p) \right) \approx \frac{n}{p} \log_2(n). The ratio then becomes .. math:: \frac{n}{p} \log_2(n) t_{\rm compare} \gg 2n t_{\rm data}. Thus, :math:`\log_2(n)` must be sufficiently high... Quicksort for Shared Memory --------------------------- The code for a recursive algorithm to partition number is listed below: :: void quicksort ( double *v, int start, int end ) { if(start < end) { int pivot; partition(v,start,end,&pivot); quicksort(v,start,pivot-1); quicksort(v,pivot+1,end); } } where ``partition`` has the prototype: :: void partition ( double *v, int lower, int upper, int *pivot ); /* precondition: upper - lower > 0 * takes v[lower] as pivot and interchanges elements: * v[i] <= v[pivot] for all i < pivot, and * v[i] > v[pivot] for all i > pivot, * where lower <= pivot <= upper. */ A ``partition`` function is defined below. :: void partition ( double *v, int lower, int upper, int *pivot ) { double x = v[lower]; int up = lower+1; /* index will go up */ int down = upper; /* index will go down */ while(up < down) { while((up < down) && (v[up] <= x)) up++; while((up < down) && (v[down] > x)) down--; if(up == down) break; double tmp = v[up]; v[up] = v[down]; v[down] = tmp; } if(v[up] > x) up--; v[lower] = v[up]; v[up] = x; *pivot = up; } The functions ``partition`` and ``qsort`` are called in the main function as follows: :: int lower = 0; int upper = n-1; int pivot = 0; if(n > 1) partition(v,lower,upper,&pivot); if(pivot != 0) qsort((void*)v,(size_t)pivot, sizeof(double),compare); if(pivot != n) qsort((void*)&v[pivot+1],(size_t)(n-pivot-1), sizeof(double),compare); With OpenMP we define a parallel region: :: omp_set_num_threads(2); #pragma omp parallel { if(pivot != 0) qsort((void*)v,(size_t)pivot, sizeof(double),compare); if(pivot != n) qsort((void*)&v[pivot+1],(size_t)(n-pivot-1), sizeof(double),compare); } The output of runs on a dual core Mac OS X at 2.26 GHz are below: :: $ time /tmp/time_qsort 10000000 0 time elapsed : 4.0575 seconds real 0m4.299s user 0m4.229s sys 0m0.068s $ time /tmp/part_qsort_omp 10000000 0 pivot = 4721964 -> sorting the first half : 4721964 numbers -> sorting the second half : 5278035 numbers real 0m3.794s user 0m7.117s sys 0m0.066s Speed up: 4.299/3.794 = 1.133, or 13.3 percent, which is faster with one extra core. Radix Sort for GPU acceleration ------------------------------- Radix sort is introduced on an example of 16 4-bit numbers, with :numref:`figradixsort1` showing the first step. .. _figradixsort1: .. figure:: ./figradixsort1.png :align: center The numbers are placed in two buckets, along the last bit. Within each bucket, the original order is preserved. Steps 2, 3, and 4 are shown in :numref:`figradixsort2`, :numref:`figradixsort3`, and :numref:`figradixsort4` respectively. .. _figradixsort2: .. figure:: ./figradixsort2.png :align: center The numbers are placed in two buckets, along the second to last bit. Within each bucket, the original order is preserved. .. _figradixsort3: .. figure:: ./figradixsort3.png :align: center The numbers are placed in two buckets, along the second bit. Within each bucket, the original order is preserved. .. _figradixsort4: .. figure:: ./figradixsort4.png :align: center The numbers are placed in two buckets, along the first bit. By the invariant: *within each bucket, the original order is preserved*, the list is sorted. In a parallel implementation, on 16 numbers, work with 16 threads, where the *i*-th thread computes the position of the *i*-th input number in the output. The rules for this computation follow: * Mapping to a zero bucket: .. math:: \begin{array}{rcl} \mbox{destination of a zero} & = & \mbox{\#zeros before}, \\ & = & \mbox{\#keys before} - \mbox{\#ones before}, \\ & = & \mbox{key index} - \mbox{\#ones before}. \end{array} * Mapping to a one bucket: .. math:: \begin{array}{rcl} \mbox{destination of a one} & = & \mbox{\#zeros in total} + \mbox{\#ones before}, \\ & = & \mbox{\#keys in total} - \mbox{\#ones in total} + \mbox{\#ones before}, \\ & = & \mbox{input size} - \mbox{\#ones in total} + \mbox{\#ones before}. \end{array} The nontrivial computation is counting the number of keys before it that map to the 1 bucket. This is done by an *exclusive scan*, illustrated in :numref:`figradixsortscan`. .. _figradixsortscan: .. figure:: ./figradixsortscan.png :align: center Counting the output position with an :index:`exclusive scan`. The exclusive scan is similar to the prefix sum scan. One major inefficiency is the writing to the output, which has an access pattern that cannot be memory coalesced. A local sort in shared memory improves memory coalescing. Memory coalescing is also improved by thread coarsening, when each thread is assigned to multiple keys instead of one. GPU accelerated sorting is provided by the software Thrust. Chapter 13 in the fourth edition of *Programming Massively Parallel Programming* is entirely devoted to sorting. Bibliography ------------ 1. N. Bell and J. Hoberock: **Thrust: a productivity-oriented library for CUDA.** In *GPU Computing Gems Jade Edition.* Pages 359-371, Morgan Kaufmann, 2012. 2. Matteo Frigo and Steven G. Johnson: **The Design and Implementation of FFTW3**. *Proc. IEEE* 93(2): 216--231, 2005. 3. Edgar Solomonik and Laxmikant V. Kale: **Highly Scalable Parallel Sorting.** In the proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2010. 4. Mirko Rahn, Peter Sanders, and Johannes Singler: **Scalable Distributed-Memory External Sorting.** In the proceedings of the 26th IEEE International Conference on Data Engineering (ICDE), pages 685-688, IEEE, 2010. 5. Davide Pasetto and Albert Akhriev: **A Comparative Study of Parallel Sort Algorithms.** In SPLASH'11, the proceedings of the ACM international conference companion on object oriented programming systems languages and applications, pages 203-204, ACM 2011. Exercises --------- 1. Consider the fan out scatter and fan in gather operations and investigate how these operations will reduce the communication cost and improve the computation/communication ratio in bucket sort of :math:`n` numbers on :math:`p` processors. 2. Use Julia to implement radix sort in parallel with tasking. 3. Implement parallel radix sort on a GPU using Julia.