Parallel Sorting Algorithms¶
Sorting is one of the most fundamental problems. On distributed memory computers, we study parallel bucket sort. On shared memory computers, we examine parallel quicksort.
Sorting in C and C++¶
In C we have the function qsort, which implements quicksort.
The prototype is
void qsort ( void *base, size_t count, size_t size,
int (*compar)(const void *element1, const void *element2) );
qsort sorts an array whose first element is pointed to by base
and contains count elements, of the given size.
The function compar returns
- \(-1\) if
element1\(<\)element2; - \(0\) if
element1\(=\)element2; or - \(+1\) if
element1\(>\)element2.
We will apply qsort to sort a random sequence of doubles.
Functions to generate an array of random numbers and to write
the array are listed below.
void random_numbers ( int n, double a[n] )
{
int i;
for(i=0; i<n; i++)
a[i] = ((double) rand())/RAND_MAX;
}
void write_numbers ( int n, double a[n] )
{
int i;
for(i=0; i<n; i++) printf("%.15e\n", a[i]);
}
To apply qsort, we define the compare function:
int compare ( const void *e1, const void *e2 )
{
double *i1 = (double*)e1;
double *i2 = (double*)e2;
return ((*i1 < *i2) ? -1 : (*i1 > *i2) ? +1 : 0);
}
Then we can call qsort in the function main() as follows:
double *a = (double*)calloc(n,sizeof(double));
random_numbers(n,a);
qsort((void*)a,(size_t)n,sizeof(double),compare);
We use the command line to enter the dimension and to toggle off the output. To measure the CPU time for sorting:
clock_t tstart,tstop;
tstart = clock();
qsort((void*)a,(size_t)n,sizeof(double),compare);
tstop = clock();
printf("time elapsed : %.4lf seconds\n",
(tstop - tstart)/((double) CLOCKS_PER_SEC));
Some timings with qsort on 3.47Ghz Intel Xeon are below:
$ time /tmp/time_qsort 1000000 0
time elapsed : 0.2100 seconds
real 0m0.231s
user 0m0.225s
sys 0m0.006s
$ time /tmp/time_qsort 10000000 0
time elapsed : 2.5700 seconds
real 0m2.683s
user 0m2.650s
sys 0m0.033s
$ time /tmp/time_qsort 100000000 0
time elapsed : 29.5600 seconds
real 0m30.641s
user 0m30.409s
sys 0m0.226s
Observe that \(O(n \log_2(n))\) is almost linear in \(n\).
In C++ we apply the sort of the Standard Template Library (STL),
in particular, we use the STL container vector.
Functions to generate vectors of random numbers and to write them
are given next.
#include <iostream>
#include <iomanip>
#include <vector>
using namespace std;
vector<double> random_vector ( int n ); // returns a vector of n random doubles
void write_vector ( vector<double> v ); // writes the vector v
vector<double> random_vector ( int n )
{
vector<double> v;
for(int i=0; i<n; i++)
{
double r = (double) rand();
r = r/RAND_MAX;
v.push_back(r);
}
return v;
}
void write_vector ( vector<double> v )
{
for(int i=0; i<v.size(); i++)
cout << scientific << setprecision(15) << v[i] << endl;
}
To use the sort of the STL, we define the compare function,
including the algorithm header:
#include <algorithm>
struct less_than // defines "<"
{
bool operator()(const double& a, const double& b)
{
return (a < b);
}
};
In the main program, to sort the vector v we write
sort(v.begin(), v.end(), less_than());
Timings of the STL sort on 3.47Ghz Intel Xeon are below.
$ time /tmp/time_stl_sort 1000000 0
time elapsed : 0.36 seconds
real 0m0.376s
user 0m0.371s
sys 0m0.004s
$ time /tmp/time_stl_sort 10000000 0
time elapsed : 4.09 seconds
real 0m4.309s
user 0m4.275s
sys 0m0.033s
$ time /tmp/time_stl_sort 100000000 0
time elapsed : 46.5 seconds
real 0m48.610s
user 0m48.336s
sys 0m0.267s
Different distributions may cause timings to fluctuate.
Bucket Sort for Distributed Memory¶
On distributed memory computers, we explain bucket sort. Given are n numbers, suppose all are in \([0,1]\). The algorithm using p buckets proceeds in two steps:
- Partition numbers x into p buckets: \(x \in [i/p,(i+1)/p[ \quad \Rightarrow \quad x \in (i+1)\)-th bucket.
- Sort all p buckets.
The cost to partition the numbers into p buckets is \(O(n \log_2(p))\). Note: radix sort uses most significant bits to partition. In the best case: every bucket contains \(n/p\) numbers. The cost of Quicksort is \(O(n/p \log_2(n/p))\) per bucket. Sorting p buckets takes \(O(n \log_2(n/p))\). The total cost is \(O(n ( \log_2(p) + \log_2(n/p) ))\).
parallel bucket sort¶
On p processors, all nodes sort:
- The root node distributes numbers: processor i gets i-th bucket.
- Processor i sorts i-th bucket.
- The Root node collects sorted buckets from processors.
Is it worth it? Recall: the serial cost is \(n ( \log_2(p) + \log_2(n/p) )\). The cost of parallel algorithm:
- \(n \log_2(p)\) to place numbers into buckets, and
- \(n/p \log_2(n/p)\) to sort buckets.
Then we compute the speedup:
Comparing to quicksort, the speedup is
For example, \(n = 2^{20}, \log_2(n) = 20, p = 2^2, \log_2(p) = 2\), then
communication versus computation¶
The scatter of \(n\) data elements costs \(t_{\rm start~up} + n t_{\rm data}\), where \(t_{\rm data}\) is the cost of sending 1 data element. For distributing and collecting of all buckets, the total communication time is \(2 p \left( t_{\rm start~up} + \frac{n}{p} t_{\rm data}\right)\). The computation/communication ratio is
where \(t_{\rm compare}\) is the cost for one comparison. The computation/communication ratio is
where \(t_{\rm compare}\) is the cost for one comparison. We view this ratio for \(n \gg p\), for fixed \(p\), so:
The ratio then becomes \(\displaystyle \frac{n}{p} \log_2(n) t_{\rm compare} \gg 2n t_{\rm data}\). Thus \(\log_2(n)\) must be sufficiently high...
Bibliography¶
- Edgar Solomonik and Laxmikant V. Kale: Highly Scalable Parallel Sorting. In the proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2010.
- Mirko Rahn, Peter Sanders, and Johannes Singler: Scalable Distributed-Memory External Sorting. In the proceedings of the 26th IEEE International Conference on Data Engineering (ICDE), pages 685-688, IEEE, 2010.
- Davide Pasetto and Albert Akhriev: A Comparative Study of Parallel Sort Algorithms. In SPLASH‘11, the proceedings of the ACM international conference companion on object oriented programming systems languages and applications, pages 203-204, ACM 2011.
Exercises¶
- Consider the fan out scatter and fan in gather operations and investigate how these operations will reduce the communication cost and improve the computation/communication ratio in bucket sort of n numbers on p processors.
- Instead of OpenMP, use Pthreads to run Quicksort on two cores.
- Instead of OpenMP, use the Intel Threading Building Blocks to run Quicksort on two cores.