Parallel FFT and Sorting
========================

As our third application for parallel algorithms, we consider
the Fast Fourier Transform (FFT).  Instead of investigating the algorithmic
aspects in detail, we introduce the FFT with an excellent software
library FFTW which supports for OpenMP. 
Both the FFT and Sorting have a complexity of :math:`O(n \log(n))`
where :math:`n` is the size of the input.

The Fast Fourier Transform in Parallel
--------------------------------------

A periodic function :math:`f(t)`
can be written as a series of sinusoidal waveforms
of various frequencies and amplitudes: the Fourier series.
The Fourier transform maps :math:`f`
from the time to the frequency domain.
In discrete form:

.. math::
   F_k = \frac{1}{n} \sum_{j=0}^{n-1} f_j e^{-2 \pi i (jk/n)},
   \quad k=0,1,\ldots,n-1, f_k = f(x_k).

The Discrete Fourier Transform (DFT) maps a convolution
into a componentwise product.
The Fast Fourier Transform is an algorithm 
that reduces the cost of the DFT from :math:`O(n^2)`
to :math:`O(n \log(n))`, for length :math:`n`.
The are many applications, for example: signal and image processing.

FFTW (the Fastest Fourier Transform in the West)
is a library for the Discrete Fourier Transform (DFT),
developed at MIT by Matteo Frigo and Steven G. Johnson
available under the GNU GPL license at <http://www.fftw.org>.
FFTW received the 1999 J. H. Wilkinson Prize for Numerical Software.
FFTW 3.3.3 supports MPI and comes with multithreaded versions:
with Cilk, Pthreads and OpenMP are supported.
Before ``make install``, do

::

   ./configure --enable-mpi --enable-openmp --enable-threads

We illustrate the use of FFTW for signal processing.

Consider :math:`f(t) = 2 \cos(4 \times 2 \pi t) + 5 \sin(10 \times 2 \pi t)`,
for :math:`t \in [0,1]`.  Its plot is in :numref:`figtestsignal`.

.. _figtestsignal:

.. figure:: ./figtestsignal.png
    :align: center

    A simple test signal.

Compiling and running the code below
gives the frequencies and amplitudes of the test signal.

::

   $ make fftw_use
   gcc fftw_use.c -o /tmp/fftw_use -lfftw3 -lm
   $ /tmp/fftw_use
   scanning through the output of FFTW...
   at 4 : (2.56000e+02,-9.63913e-14)
   => frequency 4 and amplitude 2.000e+00
   at 10 : (-7.99482e-13,-6.40000e+02)
   => frequency 10 and amplitude 5.000e+00
   $ 


Calling the FFTW software on a test signal:

::

   #include <stdlib.h>
   #include <stdio.h>
   #include <math.h>
   #include <fftw3.h>

   double my_signal ( double t );
   /* 
    * Defines a signal composed of a cosine of frequency 4 and amplitude 2, 
    * a sine of frequency 10 and amplitude 5. */

   void sample_signal ( double f ( double t ), int m, double *x, double *y );
   /*
    * Takes m = 2^k samples of the signal f, returns abscisses in x and samples in y. */

   int main ( int argc, char *argv[] )
   {
      const int n = 256;
      double *x, *y;
      x = (double*)calloc(n,sizeof(double));
      y = (double*)calloc(n,sizeof(double));
      sample_signal(my_signal,n,x,y);

      int m = n/2+1;
      fftw_complex *out;
      out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*m);

      fftw_plan p;
      p = fftw_plan_dft_r2c_1d(n,y,out,FFTW_ESTIMATE);
      fftw_execute(p);
      printf("scanning through the output of FFTW...\n");
      double tol = 1.0e-8;  /* threshold on noise */
      int i;
      for(i=0; i<m; i++)
      {
         double v = fabs(out[i][0])/(m-1)
                  + fabs(out[i][1])/(m-1);
         if(v > tol)
         {
            printf("at %d : (%.5e,%.5e)\n",
                   i,out[i][0],out[i][1]);
            printf("=> frequency %d and amplitude %.3e\n",
                   i,v);
         }
      }
      return 0;
   }

Next is the running the OpenMP version of FFTW.
For timing purposes, we define a program that takes in
the parameters at the command line.
The parameters are ``m`` and ``n``:
We will run the fftw ``m`` times for dimension ``n``.
The setup of our experiment is in the code below
followed by the parallel version with OpenMP.

Setup of the experiment to execute the fftw:

::

   #include <stdlib.h>
   #include <stdio.h>
   #include <time.h>
   #include <math.h>
   #include <fftw3.h>

   void random_sequence ( int n, double *re, double *im );
   /* 
    * Returns a random sequence of n complex numbers
    * with real parts in re and imaginary parts in im. */

   int main ( int argc, char *argv[] )
   {
      int n,m;
      if(argc > 1)
         n = atoi(argv[1]);
      else
         printf("please provide dimension on command line\n");
      m = (argc > 2) ? atoi(argv[2]) : 1;
      double *x, *y;
      x = (double*)calloc(n,sizeof(double));
      y = (double*)calloc(n,sizeof(double));
      random_sequence(n,x,y);
      fftw_complex *in,*out;
      in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*n);
      out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex)*n);
      fftw_plan p;
      p = fftw_plan_dft_1d(n,in,out,FFTW_FORWARD,FFTW_ESTIMATE);
      clock_t tstart,tstop;
      tstart = clock();
      int i;
      for(i=0; i<m; i++) fftw_execute(p);
      tstop = clock();
      printf("%d iterations took %.3f seconds\n",m,(tstop-tstart)/((double) CLOCKS_PER_SEC));
      return 0;
   }

Code to runt the OpenMP version of FFTW is below.

::

   #include <omp.h>
   #include <fftw3.h>

   int main ( int argc, char *argv[] )
   {
      int t;  /* number of threads */

      t = (argc > 3) ? atoi(argv[3]) : 1;

      int okay = fftw_init_threads();
      if(okay == 0)
         printf("error in thread initialization\n");
      omp_set_num_threads(t);
      fftw_plan_with_nthreads(t);

To define the compilation in the makefile, we must link
with the OpenMP library version fftw3 and with the fftw3 library:

::

   fftw_timing_omp:
           gcc -fopenmp fftw_timing_omp.c -o /tmp/fftw_timing_omp \
               -lfftw3_omp -lfftw3 -lm

Results are in :numref:`tabfftwomptimes1` and :numref:`tabfftwomptimes2`.

.. _tabfftwomptimes1:

.. table::
   Running times (with the unix command ``time``),
   for computing 1,000 DFTs for :math:`n = 100,000`, with *p* threads,
   where real = wall clock time, user = cpu time, sys = system time,
   obtained on two 8-core processors at 2.60 Ghz.

   +-----+--------+--------+--------+----------+
   | *p* | real   | user   | sys    | speed up |
   +=====+========+========+========+==========+
   | 1   | 2.443s | 2.436s | 0.004s | 1        |
   +-----+--------+--------+--------+----------+
   | 2   | 1.340s | 2.655s | 0.010s | 1.823    |
   +-----+--------+--------+--------+----------+
   | 4   | 0.774s | 2.929s | 0.008s | 3.156    |
   +-----+--------+--------+--------+----------+
   | 8   | 0.460s | 3.593s | 0.017s | 5.311    |
   +-----+--------+--------+--------+----------+
   | 16  | 0.290s | 4.447s | 0.023s | 8.424    |
   +-----+--------+--------+--------+----------+

.. _tabfftwomptimes2:

.. table::
   Running times (with the unix command ``time``),
   for computing 1,000 DFTs for :math:`n = 1,000,000`, with *p* threads,
   where real = wall clock time, user = cpu time, sys = system time,
   obtained on two 8-core processors at 2.60 Ghz.

   +-----+---------+---------+--------+----------+
   | *p* | real    |  user   | sys    | speed up |
   +=====+=========+=========+========+==========+
   | 1   | 44.806s | 44.700s | 0.014s | 1        |
   +-----+---------+---------+--------+----------+
   | 2   | 22.151s | 44.157s | 0.020s | 2.023    |
   +-----+---------+---------+--------+----------+
   | 4   | 10.633s | 42.336s | 0.019s | 4.214    |
   +-----+---------+---------+--------+----------+
   | 8   | 6.998s  | 55.630s | 0.036s | 6.403    |
   +-----+---------+---------+--------+----------+
   | 16  | 3.942s  | 62.250s | 0.134s | 11.366   |
   +-----+---------+---------+--------+----------+

Comparing the speedups in :numref:`tabfftwomptimes1`
with those in :numref:`tabfftwomptimes2`,
we observe that speedups improve with a ten fold increase of :math:`n`.

Bucket Sort for Distributed Memory
----------------------------------

Given are :math:`n` numbers, suppose all are in :math:`[0,1]`.

The algorithm using :math:`p` buckets proceeds in two steps:

1. Partition numbers :math:`x` into :math:`p` buckets:

   .. math::

      x \in [i/p,(i+1)/p[ \quad \Rightarrow \quad x \in (i+1)\mbox{th bucket}.

2. Sort all :math:`p` buckets.

The cost to partition the numbers 
into :math:`p` buckets is :math:`O(n \log_2(p))`.
Note: radix sort uses most significant bits to partition.

In the best case: every bucket contains :math:`n/p` numbers.
The cost of Quicksort is :math:`O(n/p \log_2(n/p))` per bucket.
Sorting :math:`p` buckets takes :math:`O(n \log_2(n/p))`.
The total cost is :math:`O(n ( \log_2(p) + \log_2(n/p) ))`.

On :math:`p` processors, all nodes sort:

1. Root node distributes numbers:
   processor *i* gets *i*-th bucket.

2. Processor *i* sorts *i*-th bucket.

3. Root node collects sorted buckets from processors.

Is it worth it?  
Recall: serial cost is :math:`n ( \log_2(p) + \log_2(n/p) )`.

The cost of parallel algorithm is

* :math:`n \log_2(p)` to place numbers into buckets,

* :math:`n/p \log_2(n/p)` to sort buckets.

.. math::

   \mbox{speedup} = \frac{n ( \log_2(p) + \log_2(n/p) )}
                         {n ( \log_2(p) + \log_2(n/p)/p )},

or equivalently,

.. math::

   \mbox{speedup} =
   \frac{1 + L}{1 + L/p} = \frac{1+L}{(p+L)/p} = \frac{p}{p+L} (1 + L),
   \quad L = \frac{\log_2(n/p)}{\log_2(p)}.

Let us compare to quicksort. 

.. math::

   \begin{array}{rcl}
   \mbox{speedup}
   & = & {\displaystyle \frac{n \log_2(n)}{n(\log_2(p) + n/p \log_2(n/p))}} \\
   & = & {\displaystyle \frac{\log_2(n)}{\log_2(p) + 1/p (\log_2(n)-\log_2(p))}} \\
   & = & {\displaystyle \frac{\log_2(n)}{1/p (\log_2(n) + (1 - 1/p) \log_2(p))}}
   \end{array}

For example, consider:
:math:`n = 2^{20}`, :math:`\log_2(n) = 20`, :math:`p = 2^2`,
:math:`\log_2(p) = 2`,

.. math::

   \begin{array}{rcl}
   \mbox{speedup} & = & {\displaystyle \frac{20}{1/4 (20) + (1 - 1/4) 2}} \\
                  & = & {\displaystyle \frac{20}{5 + 3/2} = \frac{40}{13} \approx 3.08.}
   \end{array}

Next the communication versus computation costs are examined.

The scatter of $n$ data elements costs
:math:`t_{\rm start~up} + n t_{\rm data}`,
where :math:`t_{\rm data}` is the cost of sending 1 data element.
For distributing and collecting of all buckets,
the total communication time is

.. math::

   2 p \left( t_{\rm start~up} + \frac{n}{p} t_{\rm data}\right).

The computation/communication ratio is

.. math::

   \frac{(n \log_2(p) + n/p \log_2(n/p) ) t_{\rm compare} }
        {2 p \left( t_{\rm start~up} + \frac{n}{p} t_{\rm data}\right)}

where :math:`t_{\rm compare}` is the cost for one comparison.

We view this ratio for :math:`n \gg p`,
for fixed :math:`p`, so:

.. math::

   \frac{n}{p} \log_2 \left( \frac{n}{p} \right)
   = \frac{n}{p} \left( \log_2(n) - \log_2(p) \right)
   \approx \frac{n}{p} \log_2(n).

The ratio then becomes 

.. math::

   \frac{n}{p} \log_2(n) t_{\rm compare} \gg 2n t_{\rm data}.

Thus, :math:`\log_2(n)` must be sufficiently high...

Quicksort for Shared Memory
---------------------------

The code for a recursive algorithm to partition number
is listed below:

::

   void quicksort ( double *v, int start, int end )
   {
      if(start < end)
      {
         int pivot;
         partition(v,start,end,&pivot);
         quicksort(v,start,pivot-1);
         quicksort(v,pivot+1,end);
      }
   }

where ``partition`` has the prototype:

::

   void partition
    ( double *v, int lower, int upper, int *pivot );
   /* precondition: upper - lower > 0
    * takes v[lower] as pivot and interchanges elements:
    * v[i] <= v[pivot] for all i < pivot, and
    * v[i] > v[pivot] for all i > pivot,
    * where lower <= pivot <= upper. */

A ``partition`` function is defined below.

::

   void partition
    ( double *v, int lower, int upper, int *pivot )
   {
      double x = v[lower];
      int up = lower+1;   /* index will go up */
      int down = upper;   /* index will go down */
      while(up < down)
      {
         while((up < down) && (v[up] <= x)) up++;
         while((up < down) && (v[down] > x)) down--;
         if(up == down) break;
         double tmp = v[up];
         v[up] = v[down]; v[down] = tmp;
      }
      if(v[up] > x) up--;
      v[lower] = v[up]; v[up] = x;
      *pivot = up;
   }

The functions ``partition`` and ``qsort`` 
are called in the main function as follows:

::

   int lower = 0;
   int upper = n-1;
   int pivot = 0;

   if(n > 1) partition(v,lower,upper,&pivot);
  
   if(pivot != 0)
      qsort((void*)v,(size_t)pivot,
            sizeof(double),compare);

   if(pivot != n)
      qsort((void*)&v[pivot+1],(size_t)(n-pivot-1),
            sizeof(double),compare);


With OpenMP we define a parallel region:

::

   omp_set_num_threads(2);
   #pragma omp parallel
   {
      if(pivot != 0)
         qsort((void*)v,(size_t)pivot,
               sizeof(double),compare);
      if(pivot != n)
         qsort((void*)&v[pivot+1],(size_t)(n-pivot-1),
               sizeof(double),compare);
   }

The output of runs on a dual core Mac OS X at 2.26 GHz are below:

::

   $ time /tmp/time_qsort 10000000 0
   time elapsed : 4.0575 seconds

   real    0m4.299s
   user    0m4.229s
   sys     0m0.068s

   $ time /tmp/part_qsort_omp 10000000 0
   pivot = 4721964
   -> sorting the first half : 4721964 numbers
   -> sorting the second half : 5278035 numbers

   real    0m3.794s
   user    0m7.117s
   sys     0m0.066s

Speed up: 4.299/3.794 = 1.133, or 13.3 percent,
which is faster with one extra core.

Radix Sort for GPU acceleration
-------------------------------

Radix sort is introduced on an example of 16 4-bit numbers,
with :numref:`figradixsort1` showing the first step.

.. _figradixsort1:

.. figure:: ./figradixsort1.png
    :align: center

    The numbers are placed in two buckets, along the last bit.
    Within each bucket, the original order is preserved.

Steps 2, 3, and 4 are shown in
:numref:`figradixsort2`, :numref:`figradixsort3`, and 
:numref:`figradixsort4` respectively.

.. _figradixsort2:

.. figure:: ./figradixsort2.png
    :align: center

    The numbers are placed in two buckets,
    along the second to last bit.
    Within each bucket, the original order is preserved.

.. _figradixsort3:

.. figure:: ./figradixsort3.png
    :align: center

    The numbers are placed in two buckets,
    along the second bit.
    Within each bucket, the original order is preserved.

.. _figradixsort4:

.. figure:: ./figradixsort4.png
    :align: center

    The numbers are placed in two buckets,
    along the first bit.  By the invariant:
    *within each bucket, the original order is preserved*,
    the list is sorted.

In a parallel implementation,
on 16 numbers, work with 16 threads, where the *i*-th thread
computes the position of the *i*-th input number in the output.
The rules for this computation follow:

* Mapping to a zero bucket:

  .. math::

     \begin{array}{rcl}
     \mbox{destination of a zero}
     & = & \mbox{\#zeros before}, \\
     & = & \mbox{\#keys before} - \mbox{\#ones before}, \\
     & = & \mbox{key index} - \mbox{\#ones before}.
     \end{array}

* Mapping to a one bucket:

  .. math::

     \begin{array}{rcl}
     \mbox{destination of a one}
     & = & \mbox{\#zeros in total} + \mbox{\#ones before}, \\
     & = & \mbox{\#keys in total} - \mbox{\#ones in total}
         + \mbox{\#ones before}, \\
     & = & \mbox{input size} - \mbox{\#ones in total}
         + \mbox{\#ones before}.
     \end{array}

The nontrivial computation is counting the number of keys
before it that map to the 1 bucket.
This is done by an *exclusive scan*,
illustrated in :numref:`figradixsortscan`.

.. _figradixsortscan:

.. figure:: ./figradixsortscan.png
    :align: center

    Counting the output position with an :index:`exclusive scan`.

The exclusive scan is similar to the prefix sum scan.

One major inefficiency is the writing to the output,
which has an access pattern that cannot be memory coalesced.
A local sort in shared memory improves memory coalescing.
Memory coalescing is also improved by thread coarsening,
when each thread is assigned to multiple keys instead of one.

GPU accelerated sorting is provided by the software Thrust.

Chapter 13 in the fourth edition of *Programming Massively
Parallel Programming* is entirely devoted to sorting.

Bibliography
------------

1. N. Bell and J. Hoberock:
   **Thrust: a productivity-oriented library for CUDA.**
   In *GPU Computing Gems Jade Edition.*  Pages 359-371, 
   Morgan Kaufmann, 2012.

2. Matteo Frigo and Steven G. Johnson: 
   **The Design and Implementation of FFTW3**. 
   *Proc. IEEE* 93(2): 216--231, 2005.

3. Edgar Solomonik and Laxmikant V. Kale:
   **Highly Scalable Parallel Sorting.**
   In the proceedings of the IEEE International Parallel
   and Distributed Processing Symposium (IPDPS), 2010.

4. Mirko Rahn, Peter Sanders, and Johannes Singler:
   **Scalable Distributed-Memory External Sorting.**
   In the proceedings of the 26th IEEE International Conference
   on Data Engineering (ICDE), pages 685-688, IEEE, 2010.

5. Davide Pasetto and Albert Akhriev:
   **A Comparative Study of Parallel Sort Algorithms.**
   In SPLASH'11, the proceedings of the ACM
   international conference companion on object oriented programming
   systems languages and applications, pages 203-204, ACM 2011.

Exercises
---------

1. Consider the fan out scatter and fan in gather operations
   and investigate how these operations will reduce the
   communication cost and improve the computation/communication ratio
   in bucket sort of :math:`n` numbers on :math:`p` processors.

2. Use Julia to implement radix sort in parallel with tasking.

3. Implement parallel radix sort on a GPU using Julia.