Parallel Sorting Algorithms =========================== Sorting is one of the most fundamental problems. On distributed memory computers, we study parallel bucket sort. On shared memory computers, we examine parallel quicksort. Sorting in C and C++ -------------------- In C we have the function ``qsort``, which implements quicksort. The prototype is :: void qsort ( void *base, size_t count, size_t size, int (*compar)(const void *element1, const void *element2) ); ``qsort`` sorts an array whose first element is pointed to by base and contains count elements, of the given size. The function ``compar`` returns * :math:`-1` if ``element1`` :math:`<` ``element2``; * :math:`0` if ``element1`` :math:`=` ``element2``; or * :math:`+1` if ``element1`` :math:`>` ``element2``. We will apply ``qsort`` to sort a random sequence of doubles. Functions to generate an array of random numbers and to write the array are listed below. :: void random_numbers ( int n, double a[n] ) { int i; for(i=0; i *i2) ? +1 : 0); } Then we can call ``qsort`` in the function ``main()`` as follows: :: double *a = (double*)calloc(n,sizeof(double)); random_numbers(n,a); qsort((void*)a,(size_t)n,sizeof(double),compare); We use the command line to enter the dimension and to toggle off the output. To measure the CPU time for sorting: :: clock_t tstart,tstop; tstart = clock(); qsort((void*)a,(size_t)n,sizeof(double),compare); tstop = clock(); printf("time elapsed : %.4lf seconds\n", (tstop - tstart)/((double) CLOCKS_PER_SEC)); Some timings with ``qsort`` on 3.47Ghz Intel Xeon are below: :: $ time /tmp/time_qsort 1000000 0 time elapsed : 0.2100 seconds real 0m0.231s user 0m0.225s sys 0m0.006s $ time /tmp/time_qsort 10000000 0 time elapsed : 2.5700 seconds real 0m2.683s user 0m2.650s sys 0m0.033s $ time /tmp/time_qsort 100000000 0 time elapsed : 29.5600 seconds real 0m30.641s user 0m30.409s sys 0m0.226s Observe that :math:`O(n \log_2(n))` is almost linear in :math:`n`. In C++ we apply the ``sort`` of the Standard Template Library (STL), in particular, we use the STL container ``vector``. Functions to generate vectors of random numbers and to write them are given next. :: #include #include #include using namespace std; vector random_vector ( int n ); // returns a vector of n random doubles void write_vector ( vector v ); // writes the vector v vector random_vector ( int n ) { vector v; for(int i=0; i v ) { for(int i=0; i struct less_than // defines "<" { bool operator()(const double& a, const double& b) { return (a < b); } }; In the main program, to sort the vector ``v`` we write :: sort(v.begin(), v.end(), less_than()); Timings of the STL ``sort`` on 3.47Ghz Intel Xeon are below. :: $ time /tmp/time_stl_sort 1000000 0 time elapsed : 0.36 seconds real 0m0.376s user 0m0.371s sys 0m0.004s $ time /tmp/time_stl_sort 10000000 0 time elapsed : 4.09 seconds real 0m4.309s user 0m4.275s sys 0m0.033s $ time /tmp/time_stl_sort 100000000 0 time elapsed : 46.5 seconds real 0m48.610s user 0m48.336s sys 0m0.267s Different distributions may cause timings to fluctuate. Bucket Sort for Distributed Memory ---------------------------------- On distributed memory computers, we explain bucket sort. Given are *n* numbers, suppose all are in :math:`[0,1]`. The algorithm using *p* buckets proceeds in two steps: 1. Partition numbers *x* into *p* buckets: :math:`x \in [i/p,(i+1)/p[ \quad \Rightarrow \quad x \in (i+1)`-th bucket. 2. Sort all *p* buckets. The cost to partition the numbers into *p* buckets is :math:`O(n \log_2(p))`. Note: radix sort uses most significant bits to partition. In the best case: every bucket contains :math:`n/p` numbers. The cost of Quicksort is :math:`O(n/p \log_2(n/p))` per bucket. Sorting *p* buckets takes :math:`O(n \log_2(n/p))`. The total cost is :math:`O(n ( \log_2(p) + \log_2(n/p) ))`. parallel bucket sort ^^^^^^^^^^^^^^^^^^^^ On *p* processors, all nodes sort: 1. The root node distributes numbers: processor *i* gets *i*-th bucket. 2. Processor *i* sorts *i*-th bucket. 3. The Root node collects sorted buckets from processors. Is it worth it? Recall: the serial cost is :math:`n ( \log_2(p) + \log_2(n/p) )`. The cost of parallel algorithm: 1. :math:`n \log_2(p)` to place numbers into buckets, and 2. :math:`n/p \log_2(n/p)` to sort buckets. Then we compute the speedup: .. math:: \begin{array}{rcl} \mbox{speedup} & = & {\displaystyle \frac{n ( \log_2(p) + \log_2(n/p) )} {n ( \log_2(p) + \log_2(n/p)/p )}} \\ & = & {\displaystyle \frac{1 + L}{1 + L/p} = \frac{1+L}{(p+L)/p} = \frac{p}{p+L} (1 + L), \quad L = \frac{\log_2(n/p)}{\log_2(p)}}. \end{array} Comparing to quicksort, the speedup is .. math:: \begin{array}{rcl} \mbox{speedup} & = & {\displaystyle \frac{n \log_2(n)}{n(\log_2(p) + n/p \log_2(n/p))}} \\ & = & {\displaystyle \frac{\log_2(n)}{\log_2(p) + 1/p (\log_2(n)-\log_2(p))}} \\ & = & {\displaystyle \frac{\log_2(n)}{1/p (\log_2(n) + (1 - 1/p) \log_2(p))}} \end{array} For example, :math:`n = 2^{20}, \log_2(n) = 20, p = 2^2, \log_2(p) = 2`, then .. math:: \begin{array}{rcl} \mbox{speed up} & = & {\displaystyle \frac{20}{1/4 (20) + (1 - 1/4) 2}} \\ & = & {\displaystyle \frac{20}{5 + 3/2} = \frac{40}{13} \approx 3.08.} \end{array} communication versus computation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The scatter of :math:`n` data elements costs :math:`t_{\rm start~up} + n t_{\rm data}`, where :math:`t_{\rm data}` is the cost of sending 1 data element. For distributing and collecting of all buckets, the total communication time is :math:`2 p \left( t_{\rm start~up} + \frac{n}{p} t_{\rm data}\right)`. The computation/communication ratio is .. math:: \frac{(n \log_2(p) + n/p \log_2(n/p) ) t_{\rm compare} } {2 p \left( t_{\rm start~up} + \frac{n}{p} t_{\rm data}\right)} where :math:`t_{\rm compare}` is the cost for one comparison. The computation/communication ratio is .. math:: \frac{(n \log_2(p) + n/p \log_2(n/p) ) t_{\rm compare} } {2 p \left( t_{\rm start~up} + \frac{n}{p} t_{\rm data}\right)} where :math:`t_{\rm compare}` is the cost for one comparison. We view this ratio for :math:`n \gg p`, for fixed :math:`p`, so: .. math:: \frac{n}{p} \log_2 \left( \frac{n}{p} \right) = \frac{n}{p} \left( \log_2(n) - \log_2(p) \right) \approx \frac{n}{p} \log_2(n). The ratio then becomes :math:`\displaystyle \frac{n}{p} \log_2(n) t_{\rm compare} \gg 2n t_{\rm data}`. Thus :math:`\log_2(n)` must be sufficiently high... Quicksort for Shared Memory --------------------------- A recursive algorithm partitions the numbers: :: void quicksort ( double *v, int start, int end ) { if(start < end) { int pivot; partition(v,start,end,&pivot); quicksort(v,start,pivot-1); quicksort(v,pivot+1,end); } } where ``partition`` has the prototype: :: void partition ( double *v, int lower, int upper, int *pivot ); /* precondition: upper - lower > 0 * takes v[lower] as pivot and interchanges elements: * v[i] <= v[pivot] for all i < pivot, and * v[i] > v[pivot] for all i > pivot, * where lower <= pivot <= upper. */ A definition for the ``partition`` function is listed next: :: void partition ( double *v, int lower, int upper, int *pivot ) { double x = v[lower]; int up = lower+1; /* index will go up */ int down = upper; /* index will go down */ while(up < down) { while((up < down) && (v[up] <= x)) up++; while((up < down) && (v[down] > x)) down--; if(up == down) break; double tmp = v[up]; v[up] = v[down]; v[down] = tmp; } if(v[up] > x) up--; v[lower] = v[up]; v[up] = x; *pivot = up; } Then in ``main()`` we can apply ``partition`` and ``qsort`` to sort an array of ``n`` doubles: :: int lower = 0; int upper = n-1; int pivot = 0; if(n > 1) partition(v,lower,upper,&pivot); if(pivot != 0) qsort((void*)v,(size_t)pivot, sizeof(double),compare); if(pivot != n) qsort((void*)&v[pivot+1],(size_t)(n-pivot-1),sizeof(double),compare); quicksort with OpenMP ^^^^^^^^^^^^^^^^^^^^^ A parallel region in ``main()``: :: omp_set_num_threads(2); #pragma omp parallel { if(pivot != 0) qsort((void*)v,(size_t)pivot,sizeof(double),compare); if(pivot != n) qsort((void*)&v[pivot+1],(size_t)(n-pivot-1),sizeof(double),compare); } Running on dual core Mac OS X at 2.26 GHz gives the following timings: :: $ time /tmp/time_qsort 10000000 0 time elapsed : 4.0575 seconds real 0m4.299s user 0m4.229s sys 0m0.068s $ time /tmp/part_qsort_omp 10000000 0 pivot = 4721964 -> sorting the first half : 4721964 numbers -> sorting the second half : 5278035 numbers real 0m3.794s user 0m7.117s sys 0m0.066s Speed up: 4.299/3.794 = 1.133, or :math:`13.3\%` faster with one extra core. parallel sort with Intel TBB ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ At the top of the program, add the lines :: #include "tbb/parallel_sort.h" using namespace tbb; To sort a number of random doubles: :: int n; double *v; v = (double*)calloc(n,sizeof(double)); random_numbers(n,v); parallel_sort(v, v+n); The complete main program with the headers for the functions is below. :: #include #include #include #include "tbb/parallel_sort.h" using namespace tbb; void random_numbers ( int n, double *v ); /* returns n random doubles in [0,1] */ void write_numbers ( double *v, int lower, int upper ); /* writes the numbers in v from lower to upper, * including upper */ int main ( int argc, char *argv[] ) { int n; if(argc > 1) n = atoi(argv[1]); else { printf("give n : "); scanf("%d",&n); } int vb = 1; if(argc > 2) vb = atoi(argv[2]); srand(time(NULL)); double *v; v = (double*)calloc(n,sizeof(double)); random_numbers(n,v); if(vb > 0) { printf("%d random numbers : \n",n); write_numbers(v,0,n-1); } parallel_sort(v, v+n); if(vb > 0) { printf("the sorted numbers :\n"); write_numbers(v,0,n-1); } return 0; } A session with an interactive run to test the correctness goes as below. :: $ /tmp/tbb_sort 4 1 4 random numbers : 3.696845319912231e-01 7.545582678888730e-01 6.707372915329120e-01 3.402865237278335e-01 the sorted numbers : 3.402865237278335e-01 3.696845319912231e-01 6.707372915329120e-01 7.545582678888730e-01 $ We can time the parallel runs as follows. :: $ time /tmp/tbb_sort 10000000 0 real 0m0.479s user 0m4.605s sys 0m0.168s $ time /tmp/tbb_sort 100000000 0 real 0m4.734s user 0m51.063s sys 0m0.386s $ time /tmp/tbb_sort 1000000000 0 real 0m47.400s user 9m32.713s sys 0m2.073s $ Bibliography ------------ 1. Edgar Solomonik and Laxmikant V. Kale: **Highly Scalable Parallel Sorting.** In the proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2010. 2. Mirko Rahn, Peter Sanders, and Johannes Singler: **Scalable Distributed-Memory External Sorting**. In the proceedings of the 26th IEEE International Conference on Data Engineering (ICDE), pages 685-688, IEEE, 2010. 3. Davide Pasetto and Albert Akhriev: **A Comparative Study of Parallel Sort Algorithms**. In SPLASH'11, the proceedings of the ACM international conference companion on object oriented programming systems languages and applications, pages 203-204, ACM 2011. Exercises --------- 1. Consider the fan out scatter and fan in gather operations and investigate how these operations will reduce the communication cost and improve the computation/communication ratio in bucket sort of *n* numbers on *p* processors. 2. Instead of OpenMP, use Pthreads to run Quicksort on two cores. 3. Instead of OpenMP, use the Intel Threading Building Blocks to run Quicksort on two cores.