Load Balancing ============== We distinguish between static and dynamic load balancing, using the computation of the Mandelbrot set as an example. For dynamic load balancing, we encounter the need for nonblocking communications. To check for incoming messages, we use MPI_Iprobe. the Mandelbrot set ------------------ We consider computing the Mandelbrot set, shown in :numref:`figmandel` as a grayscale plot. .. _figmandel: .. figure:: ./figmandel.png :align: center The Mandelbrot set. The number :math:`n` of iterations ranges from 0 to 255. The grayscales are plotted in reverse, as :math:`255 - n`. Grayscales for different pixels are calculated independently. The prototype and definition of the function ``iterate`` is in the code below. We call ``iterate`` for all pixels (``x``, ``y``), for ``x`` and ``y`` ranging over all rows and columns of a pixel matrix. In our plot we compute 5,000 rows and 5,000 columns. :: int iterate ( double x, double y ); /* * Returns the number of iterations for z^2 + c * to grow larger than 2, for c = x + i*y, * where i = sqrt(-1), starting at z = 0. */ int iterate ( double x, double y ) { double wx,wy,v,xx; int k = 0; wx = 0.0; wy = 0.0; v = 0.0; while ((v < 4) && (k++ < 254)) { xx = wx*wx - wy*wy; wy = 2.0*wx*wy; wx = xx + x; wy = wy + y; v = wx*wx + wy*wy; } return k; } In the code for ``iterate`` we count 6 multiplications on doubles, 3 additions and 1 subtraction. On a Mac OS X laptop 2.26 Ghz Intel Core 2 Duo, for a 5,000-by-5,000 matrix of pixels: :: $ time /tmp/mandelbrot Total number of iterations : 682940922 real 0m15.675s user 0m14.914s sys 0m0.163s The program performed :math:`682,940,922 \times 10` flops in 15 seconds or 455,293,948 flops per second. Turning on full optimization and the time drops from 15 to 9 seconds. After compilation with ``-O3``, the program performed 758,823,246 :index:`flops` per second. :: $ make mandelbrot_opt gcc -O3 -o /tmp/mandelbrot_opt mandelbrot.c $ time /tmp/mandelbrot_opt Total number of iterations : 682940922 real 0m9.846s user 0m9.093s sys 0m0.163s The input parameters of the program define the intervals :math:`[a,b]` for :math:`x` and :math:`[c,d]` for :math:`y`, as :math:`(x,y) \in [a,b] \times [c,d]`, e.g.: :math:`[a,b] = [-2,+2] = [c,d]`; The number :math:`n` of rows (and columns) in pixel matrix determines the resolution of the image and the spacing between points: :math:`\delta x = (b-a)/(n-1)`, :math:`\delta y = (d-c)/(n-1)`. The output is a postscript file, which is a standard format, direct to print or view, and allows for batch processing in an environment without visualization capabilities. Static Work Load Assignment --------------------------- .. index:: static work load assignment, communication granularity Static work load assignment means that the decision which pixels are computed by which processor is fixed *in advance* (before the execution of the program) by some algorithm. For the granularity in the communcation, we have two extremes: 1. Matrix of grayscales is divided up into *p* equal parts and each processor computes part of the matrix. For example: 5,000 rows among 5 processors, each processor takes 1,000 rows. The communication happens after all calculations are done, at the end all processors send their big submatrix to root node. 2. Matrix of grayscales is distributed pixel-by-pixel. Entry :math:`(i,j)` of the *n*-by-*n* matrix is computed by processor with label :math:`( i \times n + j ) ~{\rm mod}~ p`. The communication is completely interlaced with all computation. In choosing the granularity between the two extremes: 1. Problem with all communication at end: Total cost = computational cost + communication cost. The communication cost is not interlaced with the computation. 2. Problem with pixel-by-pixel distribution: To compute the grayscale of one pixel requires at most 255 iterations, but may finish much sooner. Even in the most expensive case, processors may be mostly busy handling send/recv operations. As compromise between the two extremes, we distribute the work load along the rows. Row :math:`i` is computed by node :math:`1 + (i ~{\rm mod}~ (p-1))`. THe root node 0 distributes row indices and collects the computed rows. Static work load assignment with MPI ------------------------------------ Consider a manager/worker algorithm for static load assignment: Given :math:`n` jobs to be completed by :math:`p` processors, :math:`n \gg p`. Processor 0 is in charge of 1. distributing the jobs among the :math:`p-1` compute nodes; and 2. collecting the results from the :math:`p-1` compute nodes. Assuming :math:`n` is a multiple of :math:`p-1`, let :math:`k = n/(p-1)`. The manager executes the following algorithm: :: for i from 1 to k do for j from 1 to p-1 do send the next job to compute node j; for j from 1 to p-1 do receive result from compute node j. The run of an example program is illustrated by what is printed on screen: :: $ mpirun -np 3 /tmp/static_loaddist reading the #jobs per compute node... 1 sending 0 to 1 sending 1 to 2 node 1 received 0 -> 1 computes b node 1 sends b node 2 received 1 -> 2 computes c node 2 sends c received b from 1 received c from 2 sending -1 to 1 sending -1 to 2 The result : bc node 2 received -1 node 1 received -1 $ The main program is below, followed by the code for the worker and for the manager. :: int main ( int argc, char *argv[] ) { int i,p; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&p); MPI_Comm_rank(MPI_COMM_WORLD,&i); if(i != 0) worker(i); else { printf("reading the #jobs per compute node...\n"); int nbjobs; scanf("%d",&nbjobs); manager(p,nbjobs*(p-1)); } MPI_Finalize(); return 0; } Following is the code for each worker. :: int worker ( int i ) { int myjob; MPI_Status status; do { MPI_Recv(&myjob,1,MPI_INT,0,tag, MPI_COMM_WORLD,&status); if(v == 1) printf("node %d received %d\n",i,myjob); if(myjob == -1) break; char c = 'a' + ((char)i); if(v == 1) printf("-> %d computes %c\n",i,c); if(v == 1) printf("node %d sends %c\n",i,c); MPI_Send(&c,1,MPI_CHAR,0,tag,MPI_COMM_WORLD); } while(myjob != -1); return 0; } Following is the code for the manager. :: int manager ( int p, int n ) { char result[n+1]; int job = -1; int j; do { for(j=1; j
= n) break; int d = 1 + (job % (p-1)); if(v == 1) printf("sending %d to %d\n",job,d); MPI_Send(&job,1,MPI_INT,d,tag,MPI_COMM_WORLD); } if(job >= n) break; for(j=1; j