Parallel Sorting Algorithms¶
Sorting is one of the most fundamental problems. On distributed memory computers, we study parallel bucket sort. On shared memory computers, we examine parallel quicksort.
Sorting in C and C++¶
In C we have the function qsort
, which implements quicksort.
The prototype is
void qsort ( void *base, size_t count, size_t size,
int (*compar)(const void *element1, const void *element2) );
qsort
sorts an array whose first element is pointed to by base
and contains count elements, of the given size.
The function compar
returns
- \(-1\) if
element1
\(<\)element2
; - \(0\) if
element1
\(=\)element2
; or - \(+1\) if
element1
\(>\)element2
.
We will apply qsort
to sort a random sequence of doubles.
Functions to generate an array of random numbers and to write
the array are listed below.
void random_numbers ( int n, double a[n] )
{
int i;
for(i=0; i<n; i++)
a[i] = ((double) rand())/RAND_MAX;
}
void write_numbers ( int n, double a[n] )
{
int i;
for(i=0; i<n; i++) printf("%.15e\n", a[i]);
}
To apply qsort
, we define the compare
function:
int compare ( const void *e1, const void *e2 )
{
double *i1 = (double*)e1;
double *i2 = (double*)e2;
return ((*i1 < *i2) ? -1 : (*i1 > *i2) ? +1 : 0);
}
Then we can call qsort
in the function main()
as follows:
double *a = (double*)calloc(n,sizeof(double));
random_numbers(n,a);
qsort((void*)a,(size_t)n,sizeof(double),compare);
We use the command line to enter the dimension and to toggle off the output. To measure the CPU time for sorting:
clock_t tstart,tstop;
tstart = clock();
qsort((void*)a,(size_t)n,sizeof(double),compare);
tstop = clock();
printf("time elapsed : %.4lf seconds\n",
(tstop - tstart)/((double) CLOCKS_PER_SEC));
Some timings with qsort
on 3.47Ghz Intel Xeon are below:
$ time /tmp/time_qsort 1000000 0
time elapsed : 0.2100 seconds
real 0m0.231s
user 0m0.225s
sys 0m0.006s
$ time /tmp/time_qsort 10000000 0
time elapsed : 2.5700 seconds
real 0m2.683s
user 0m2.650s
sys 0m0.033s
$ time /tmp/time_qsort 100000000 0
time elapsed : 29.5600 seconds
real 0m30.641s
user 0m30.409s
sys 0m0.226s
Observe that \(O(n \log_2(n))\) is almost linear in \(n\).
In C++ we apply the sort
of the Standard Template Library (STL),
in particular, we use the STL container vector
.
Functions to generate vectors of random numbers and to write them
are given next.
#include <iostream>
#include <iomanip>
#include <vector>
using namespace std;
vector<double> random_vector ( int n ); // returns a vector of n random doubles
void write_vector ( vector<double> v ); // writes the vector v
vector<double> random_vector ( int n )
{
vector<double> v;
for(int i=0; i<n; i++)
{
double r = (double) rand();
r = r/RAND_MAX;
v.push_back(r);
}
return v;
}
void write_vector ( vector<double> v )
{
for(int i=0; i<v.size(); i++)
cout << scientific << setprecision(15) << v[i] << endl;
}
To use the sort
of the STL, we define the compare function,
including the algorithm
header:
#include <algorithm>
struct less_than // defines "<"
{
bool operator()(const double& a, const double& b)
{
return (a < b);
}
};
In the main program, to sort the vector v
we write
sort(v.begin(), v.end(), less_than());
Timings of the STL sort
on 3.47Ghz Intel Xeon are below.
$ time /tmp/time_stl_sort 1000000 0
time elapsed : 0.36 seconds
real 0m0.376s
user 0m0.371s
sys 0m0.004s
$ time /tmp/time_stl_sort 10000000 0
time elapsed : 4.09 seconds
real 0m4.309s
user 0m4.275s
sys 0m0.033s
$ time /tmp/time_stl_sort 100000000 0
time elapsed : 46.5 seconds
real 0m48.610s
user 0m48.336s
sys 0m0.267s
Different distributions may cause timings to fluctuate.
Bucket Sort for Distributed Memory¶
On distributed memory computers, we explain bucket sort. Given are n numbers, suppose all are in \([0,1]\). The algorithm using p buckets proceeds in two steps:
- Partition numbers x into p buckets: \(x \in [i/p,(i+1)/p[ \quad \Rightarrow \quad x \in (i+1)\)-th bucket.
- Sort all p buckets.
The cost to partition the numbers into p buckets is \(O(n \log_2(p))\). Note: radix sort uses most significant bits to partition. In the best case: every bucket contains \(n/p\) numbers. The cost of Quicksort is \(O(n/p \log_2(n/p))\) per bucket. Sorting p buckets takes \(O(n \log_2(n/p))\). The total cost is \(O(n ( \log_2(p) + \log_2(n/p) ))\).
parallel bucket sort¶
On p processors, all nodes sort:
- The root node distributes numbers: processor i gets i-th bucket.
- Processor i sorts i-th bucket.
- The Root node collects sorted buckets from processors.
Is it worth it? Recall: the serial cost is \(n ( \log_2(p) + \log_2(n/p) )\). The cost of parallel algorithm:
- \(n \log_2(p)\) to place numbers into buckets, and
- \(n/p \log_2(n/p)\) to sort buckets.
Then we compute the speedup:
Comparing to quicksort, the speedup is
For example, \(n = 2^{20}, \log_2(n) = 20, p = 2^2, \log_2(p) = 2\), then
communication versus computation¶
The scatter of \(n\) data elements costs \(t_{\rm start~up} + n t_{\rm data}\), where \(t_{\rm data}\) is the cost of sending 1 data element. For distributing and collecting of all buckets, the total communication time is \(2 p \left( t_{\rm start~up} + \frac{n}{p} t_{\rm data}\right)\). The computation/communication ratio is
where \(t_{\rm compare}\) is the cost for one comparison. The computation/communication ratio is
where \(t_{\rm compare}\) is the cost for one comparison. We view this ratio for \(n \gg p\), for fixed \(p\), so:
The ratio then becomes \(\displaystyle \frac{n}{p} \log_2(n) t_{\rm compare} \gg 2n t_{\rm data}\). Thus \(\log_2(n)\) must be sufficiently high...
Bibliography¶
- Edgar Solomonik and Laxmikant V. Kale: Highly Scalable Parallel Sorting. In the proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2010.
- Mirko Rahn, Peter Sanders, and Johannes Singler: Scalable Distributed-Memory External Sorting. In the proceedings of the 26th IEEE International Conference on Data Engineering (ICDE), pages 685-688, IEEE, 2010.
- Davide Pasetto and Albert Akhriev: A Comparative Study of Parallel Sort Algorithms. In SPLASH‘11, the proceedings of the ACM international conference companion on object oriented programming systems languages and applications, pages 203-204, ACM 2011.
Exercises¶
- Consider the fan out scatter and fan in gather operations and investigate how these operations will reduce the communication cost and improve the computation/communication ratio in bucket sort of n numbers on p processors.
- Instead of OpenMP, use Pthreads to run Quicksort on two cores.
- Instead of OpenMP, use the Intel Threading Building Blocks to run Quicksort on two cores.