Parallel Sorting Algorithms

Sorting is one of the most fundamental problems. On distributed memory computers, we study parallel bucket sort. On shared memory computers, we examine parallel quicksort.

Sorting in C and C++

In C we have the function qsort, which implements quicksort. The prototype is

void qsort ( void *base, size_t count, size_t size,
             int (*compar)(const void *element1, const void *element2) );

qsort sorts an array whose first element is pointed to by base and contains count elements, of the given size. The function compar returns

  • \(-1\) if element1 \(<\) element2;
  • \(0\) if element1 \(=\) element2; or
  • \(+1\) if element1 \(>\) element2.

We will apply qsort to sort a random sequence of doubles. Functions to generate an array of random numbers and to write the array are listed below.

void random_numbers ( int n, double a[n] )
{
   int i;
   for(i=0; i<n; i++)
      a[i] = ((double) rand())/RAND_MAX;
}
void write_numbers ( int n, double a[n] )
{
   int i;
   for(i=0; i<n; i++) printf("%.15e\n", a[i]);
}

To apply qsort, we define the compare function:

int compare ( const void *e1, const void *e2 )
{
   double *i1 = (double*)e1;
   double *i2 = (double*)e2;
   return ((*i1 < *i2) ? -1 : (*i1 > *i2) ? +1 : 0);
}

Then we can call qsort in the function main() as follows:

double *a = (double*)calloc(n,sizeof(double));
random_numbers(n,a);
qsort((void*)a,(size_t)n,sizeof(double),compare);

We use the command line to enter the dimension and to toggle off the output. To measure the CPU time for sorting:

clock_t tstart,tstop;
tstart = clock();
qsort((void*)a,(size_t)n,sizeof(double),compare);
tstop = clock();
printf("time elapsed : %.4lf seconds\n",
       (tstop - tstart)/((double) CLOCKS_PER_SEC));

Some timings with qsort on 3.47Ghz Intel Xeon are below:

$ time /tmp/time_qsort 1000000 0
time elapsed : 0.2100 seconds
real    0m0.231s
user    0m0.225s
sys     0m0.006s
$ time /tmp/time_qsort 10000000 0
time elapsed : 2.5700 seconds
real    0m2.683s
user    0m2.650s
sys     0m0.033s
$ time /tmp/time_qsort 100000000 0
time elapsed : 29.5600 seconds
real    0m30.641s
user    0m30.409s
sys     0m0.226s

Observe that \(O(n \log_2(n))\) is almost linear in \(n\). In C++ we apply the sort of the Standard Template Library (STL), in particular, we use the STL container vector. Functions to generate vectors of random numbers and to write them are given next.

#include <iostream>
#include <iomanip>
#include <vector>
using namespace std;

vector<double> random_vector ( int n ); // returns a vector of n random doubles
void write_vector ( vector<double> v ); // writes the vector v

vector<double> random_vector ( int n )
{
   vector<double> v;
   for(int i=0; i<n; i++)
   {
      double r = (double) rand();
      r = r/RAND_MAX;
      v.push_back(r);
   }
   return v;
}
void write_vector ( vector<double> v )
{
   for(int i=0; i<v.size(); i++)
      cout << scientific << setprecision(15) << v[i] << endl;
}

To use the sort of the STL, we define the compare function, including the algorithm header:

#include <algorithm>
struct less_than // defines "<"
{
   bool operator()(const double& a, const double& b)
   {
      return (a < b);
   }
};

In the main program, to sort the vector v we write

sort(v.begin(), v.end(), less_than());

Timings of the STL sort on 3.47Ghz Intel Xeon are below.

$ time /tmp/time_stl_sort 1000000 0
time elapsed : 0.36 seconds
real    0m0.376s
user    0m0.371s
sys     0m0.004s
$ time /tmp/time_stl_sort 10000000 0
time elapsed : 4.09 seconds
real    0m4.309s
user    0m4.275s
sys     0m0.033s
$ time /tmp/time_stl_sort 100000000 0
time elapsed : 46.5 seconds
real    0m48.610s
user    0m48.336s
sys     0m0.267s

Different distributions may cause timings to fluctuate.

Bucket Sort for Distributed Memory

On distributed memory computers, we explain bucket sort. Given are n numbers, suppose all are in \([0,1]\). The algorithm using p buckets proceeds in two steps:

  1. Partition numbers x into p buckets: \(x \in [i/p,(i+1)/p[ \quad \Rightarrow \quad x \in (i+1)\)-th bucket.
  2. Sort all p buckets.

The cost to partition the numbers into p buckets is \(O(n \log_2(p))\). Note: radix sort uses most significant bits to partition. In the best case: every bucket contains \(n/p\) numbers. The cost of Quicksort is \(O(n/p \log_2(n/p))\) per bucket. Sorting p buckets takes \(O(n \log_2(n/p))\). The total cost is \(O(n ( \log_2(p) + \log_2(n/p) ))\).

parallel bucket sort

On p processors, all nodes sort:

  1. The root node distributes numbers: processor i gets i-th bucket.
  2. Processor i sorts i-th bucket.
  3. The Root node collects sorted buckets from processors.

Is it worth it? Recall: the serial cost is \(n ( \log_2(p) + \log_2(n/p) )\). The cost of parallel algorithm:

  1. \(n \log_2(p)\) to place numbers into buckets, and
  2. \(n/p \log_2(n/p)\) to sort buckets.

Then we compute the speedup:

\[\begin{split}\begin{array}{rcl} \mbox{speedup} & = & {\displaystyle \frac{n ( \log_2(p) + \log_2(n/p) )} {n ( \log_2(p) + \log_2(n/p)/p )}} \\ & = & {\displaystyle \frac{1 + L}{1 + L/p} = \frac{1+L}{(p+L)/p} = \frac{p}{p+L} (1 + L), \quad L = \frac{\log_2(n/p)}{\log_2(p)}}. \end{array}\end{split}\]

Comparing to quicksort, the speedup is

\[\begin{split}\begin{array}{rcl} \mbox{speedup} & = & {\displaystyle \frac{n \log_2(n)}{n(\log_2(p) + n/p \log_2(n/p))}} \\ & = & {\displaystyle \frac{\log_2(n)}{\log_2(p) + 1/p (\log_2(n)-\log_2(p))}} \\ & = & {\displaystyle \frac{\log_2(n)}{1/p (\log_2(n) + (1 - 1/p) \log_2(p))}} \end{array}\end{split}\]

For example, \(n = 2^{20}, \log_2(n) = 20, p = 2^2, \log_2(p) = 2\), then

\[\begin{split}\begin{array}{rcl} \mbox{speed up} & = & {\displaystyle \frac{20}{1/4 (20) + (1 - 1/4) 2}} \\ & = & {\displaystyle \frac{20}{5 + 3/2} = \frac{40}{13} \approx 3.08.} \end{array}\end{split}\]

communication versus computation

The scatter of \(n\) data elements costs \(t_{\rm start~up} + n t_{\rm data}\), where \(t_{\rm data}\) is the cost of sending 1 data element. For distributing and collecting of all buckets, the total communication time is \(2 p \left( t_{\rm start~up} + \frac{n}{p} t_{\rm data}\right)\). The computation/communication ratio is

\[\frac{(n \log_2(p) + n/p \log_2(n/p) ) t_{\rm compare} } {2 p \left( t_{\rm start~up} + \frac{n}{p} t_{\rm data}\right)}\]

where \(t_{\rm compare}\) is the cost for one comparison. The computation/communication ratio is

\[\frac{(n \log_2(p) + n/p \log_2(n/p) ) t_{\rm compare} } {2 p \left( t_{\rm start~up} + \frac{n}{p} t_{\rm data}\right)}\]

where \(t_{\rm compare}\) is the cost for one comparison. We view this ratio for \(n \gg p\), for fixed \(p\), so:

\[\frac{n}{p} \log_2 \left( \frac{n}{p} \right) = \frac{n}{p} \left( \log_2(n) - \log_2(p) \right) \approx \frac{n}{p} \log_2(n).\]

The ratio then becomes \(\displaystyle \frac{n}{p} \log_2(n) t_{\rm compare} \gg 2n t_{\rm data}\). Thus \(\log_2(n)\) must be sufficiently high...

Quicksort for Shared Memory

A recursive algorithm partitions the numbers:

void quicksort ( double *v, int start, int end ) {
   if(start < end) {
      int pivot;
      partition(v,start,end,&pivot);
      quicksort(v,start,pivot-1);
      quicksort(v,pivot+1,end);
   }
}

where partition has the prototype:

void partition ( double *v, int lower, int upper, int *pivot );
/* precondition: upper - lower > 0
 * takes v[lower] as pivot and interchanges elements:
 * v[i] <= v[pivot] for all i < pivot, and
 * v[i] > v[pivot] for all i > pivot,
 * where lower <= pivot <= upper. */

A definition for the partition function is listed next:

void partition ( double *v, int lower, int upper, int *pivot )
{
   double x = v[lower];
   int up = lower+1;   /* index will go up */
   int down = upper;   /* index will go down */
   while(up < down)
   {
      while((up < down) && (v[up] <= x)) up++;
      while((up < down) && (v[down] > x)) down--;
      if(up == down) break;
      double tmp = v[up];
      v[up] = v[down]; v[down] = tmp;
   }
   if(v[up] > x) up--;
   v[lower] = v[up]; v[up] = x;
   *pivot = up;
}

Then in main() we can apply partition and qsort to sort an array of n doubles:

int lower = 0;
int upper = n-1;
int pivot = 0;
if(n > 1) partition(v,lower,upper,&pivot);
if(pivot != 0) qsort((void*)v,(size_t)pivot, sizeof(double),compare);
if(pivot != n) qsort((void*)&v[pivot+1],(size_t)(n-pivot-1),sizeof(double),compare);

quicksort with OpenMP

A parallel region in main():

omp_set_num_threads(2);
#pragma omp parallel
{
   if(pivot != 0) qsort((void*)v,(size_t)pivot,sizeof(double),compare);
   if(pivot != n) qsort((void*)&v[pivot+1],(size_t)(n-pivot-1),sizeof(double),compare);
}

Running on dual core Mac OS X at 2.26 GHz gives the following timings:

$ time /tmp/time_qsort 10000000 0
time elapsed : 4.0575 seconds
real    0m4.299s
user    0m4.229s
sys     0m0.068s

$ time /tmp/part_qsort_omp 10000000 0
pivot = 4721964
-> sorting the first half : 4721964 numbers
-> sorting the second half : 5278035 numbers
real    0m3.794s
user    0m7.117s
sys     0m0.066s

Speed up: 4.299/3.794 = 1.133, or \(13.3\%\) faster with one extra core.

parallel sort with Intel TBB

At the top of the program, add the lines

#include "tbb/parallel_sort.h"

using namespace tbb;

To sort a number of random doubles:

int n;
double *v;
v = (double*)calloc(n,sizeof(double));
random_numbers(n,v);
parallel_sort(v, v+n);

The complete main program with the headers for the functions is below.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "tbb/parallel_sort.h"

using namespace tbb;

void random_numbers ( int n, double *v );
/* returns n random doubles in [0,1] */

void write_numbers ( double *v, int lower, int upper );
/* writes the numbers in v from lower to upper,
 * including upper */

int main ( int argc, char *argv[] )
{
   int n;
   if(argc > 1)
      n = atoi(argv[1]);
   else
   {
      printf("give n : ");
      scanf("%d",&n);
   }
   int vb = 1;
   if(argc > 2) vb = atoi(argv[2]);
   srand(time(NULL));
   double *v;
   v = (double*)calloc(n,sizeof(double));
   random_numbers(n,v);
   if(vb > 0)
   {
      printf("%d random numbers : \n",n);
      write_numbers(v,0,n-1);
   }
   parallel_sort(v, v+n);
   if(vb > 0)
   {
      printf("the sorted numbers :\n");
      write_numbers(v,0,n-1);
   }
   return 0;
}

A session with an interactive run to test the correctness goes as below.

$ /tmp/tbb_sort 4 1
4 random numbers :
3.696845319912231e-01
7.545582678888730e-01
6.707372915329120e-01
3.402865237278335e-01
the sorted numbers :
3.402865237278335e-01
3.696845319912231e-01
6.707372915329120e-01
7.545582678888730e-01
$

We can time the parallel runs as follows.

$ time /tmp/tbb_sort 10000000 0

real    0m0.479s
user    0m4.605s
sys     0m0.168s
$ time /tmp/tbb_sort 100000000 0

real    0m4.734s
user    0m51.063s
sys     0m0.386s
$ time /tmp/tbb_sort 1000000000 0

real    0m47.400s
user    9m32.713s
sys     0m2.073s
$

Bibliography

  1. Edgar Solomonik and Laxmikant V. Kale: Highly Scalable Parallel Sorting. In the proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2010.
  2. Mirko Rahn, Peter Sanders, and Johannes Singler: Scalable Distributed-Memory External Sorting. In the proceedings of the 26th IEEE International Conference on Data Engineering (ICDE), pages 685-688, IEEE, 2010.
  3. Davide Pasetto and Albert Akhriev: A Comparative Study of Parallel Sort Algorithms. In SPLASH‘11, the proceedings of the ACM international conference companion on object oriented programming systems languages and applications, pages 203-204, ACM 2011.

Exercises

  1. Consider the fan out scatter and fan in gather operations and investigate how these operations will reduce the communication cost and improve the computation/communication ratio in bucket sort of n numbers on p processors.
  2. Instead of OpenMP, use Pthreads to run Quicksort on two cores.
  3. Instead of OpenMP, use the Intel Threading Building Blocks to run Quicksort on two cores.