Load Balancing ============== We distinguish between static and dynamic load balancing, using the computation of the Mandelbrot set as an example. For dynamic load balancing, we encounter the need for nonblocking communications. To check for incoming messages, we use MPI_Iprobe. the Mandelbrot set ------------------ We consider computing the Mandelbrot set, shown in :numref:`figmandel` as a grayscale plot. .. _figmandel: .. figure:: ./figmandel.png :align: center The Mandelbrot set. The number :math:`n` of iterations ranges from 0 to 255. The grayscales are plotted in reverse, as :math:`255 - n`. Grayscales for different pixels are calculated independently. The prototype and definition of the function ``iterate`` is in the code below. We call ``iterate`` for all pixels (``x``, ``y``), for ``x`` and ``y`` ranging over all rows and columns of a pixel matrix. In our plot we compute 5,000 rows and 5,000 columns. :: int iterate ( double x, double y ); /* * Returns the number of iterations for z^2 + c * to grow larger than 2, for c = x + i*y, * where i = sqrt(-1), starting at z = 0. */ int iterate ( double x, double y ) { double wx,wy,v,xx; int k = 0; wx = 0.0; wy = 0.0; v = 0.0; while ((v < 4) && (k++ < 254)) { xx = wx*wx - wy*wy; wy = 2.0*wx*wy; wx = xx + x; wy = wy + y; v = wx*wx + wy*wy; } return k; } In the code for ``iterate`` we count 6 multiplications on doubles, 3 additions and 1 subtraction. On a Mac OS X laptop 2.26 Ghz Intel Core 2 Duo, for a 5,000-by-5,000 matrix of pixels: :: $ time /tmp/mandelbrot Total number of iterations : 682940922 real 0m15.675s user 0m14.914s sys 0m0.163s The program performed :math:`682,940,922 \times 10` flops in 15 seconds or 455,293,948 flops per second. Turning on full optimization and the time drops from 15 to 9 seconds. After compilation with ``-O3``, the program performed 758,823,246 :index:`flops` per second. :: $ make mandelbrot_opt gcc -O3 -o /tmp/mandelbrot_opt mandelbrot.c $ time /tmp/mandelbrot_opt Total number of iterations : 682940922 real 0m9.846s user 0m9.093s sys 0m0.163s The input parameters of the program define the intervals :math:`[a,b]` for :math:`x` and :math:`[c,d]` for :math:`y`, as :math:`(x,y) \in [a,b] \times [c,d]`, e.g.: :math:`[a,b] = [-2,+2] = [c,d]`; The number :math:`n` of rows (and columns) in pixel matrix determines the resolution of the image and the spacing between points: :math:`\delta x = (b-a)/(n-1)`, :math:`\delta y = (d-c)/(n-1)`. The output is a postscript file, which is a standard format, direct to print or view, and allows for batch processing in an environment without visualization capabilities. Granularity ----------- .. index:: grain, granularity .. topic:: Definition of Grain A *grain* is a sequence of computational steps for sequential execution on a single processor. Depending on the grain size, we distinguish between .. index:: fine granularity, coarse granularity * small grain size: *fine granularity*, * large grain size: *coarse granularity*. There is a tradeoff to make: * Coarse granularity has little communication overhead, but may limit the amount of parallelism; while * fine granularity promotes parallelism, but may lead to an excessive communication overhead. Static Work Load Assignment --------------------------- .. index:: static work load assignment, communication granularity Static work load assignment means that the decision which pixels are computed by which processor is fixed *in advance* (before the execution of the program) by some algorithm. For the granularity in the communcation, we have two extremes: 1. Matrix of grayscales is divided up into *p* equal parts and each processor computes part of the matrix. For example: 5,000 rows among 5 processors, each processor takes 1,000 rows. The communication happens after all calculations are done, at the end all processors send their big submatrix to root node. 2. Matrix of grayscales is distributed pixel-by-pixel. Entry :math:`(i,j)` of the *n*-by-*n* matrix is computed by processor with label :math:`( i \times n + j ) ~{\rm mod}~ p`. The communication is completely interlaced with all computation. In choosing the granularity between the two extremes: 1. Problem with all communication at end: Total cost = computational cost + communication cost. The communication cost is not interlaced with the computation. 2. Problem with pixel-by-pixel distribution: To compute the grayscale of one pixel requires at most 255 iterations, but may finish much sooner. Even in the most expensive case, processors may be mostly busy handling send/recv operations. As compromise between the two extremes, we distribute the work load along the rows. Row :math:`i` is computed by node :math:`1 + (i ~{\rm mod}~ (p-1))`. THe root node 0 distributes row indices and collects the computed rows. Static work load assignment with MPI ------------------------------------ Consider a manager/worker algorithm for static load assignment: Given :math:`n` jobs to be completed by :math:`p` processors, :math:`n \gg p`. Processor 0 is in charge of 1. distributing the jobs among the :math:`p-1` compute nodes; and 2. collecting the results from the :math:`p-1` compute nodes. Assuming :math:`n` is a multiple of :math:`p-1`, let :math:`k = n/(p-1)`. The manager executes the following algorithm: :: for i from 1 to k do for j from 1 to p-1 do send the next job to compute node j; for j from 1 to p-1 do receive result from compute node j. The run of an example program is illustrated by what is printed on screen: :: $ mpirun -np 3 /tmp/static_loaddist reading the #jobs per compute node... 1 sending 0 to 1 sending 1 to 2 node 1 received 0 -> 1 computes b node 1 sends b node 2 received 1 -> 2 computes c node 2 sends c received b from 1 received c from 2 sending -1 to 1 sending -1 to 2 The result : bc node 2 received -1 node 1 received -1 $ The main program is below, followed by the code for the worker and for the manager. :: int main ( int argc, char *argv[] ) { int i,p; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&p); MPI_Comm_rank(MPI_COMM_WORLD,&i); if(i != 0) worker(i); else { printf("reading the #jobs per compute node...\n"); int nbjobs; scanf("%d",&nbjobs); manager(p,nbjobs*(p-1)); } MPI_Finalize(); return 0; } Following is the code for each worker. :: int worker ( int i ) { int myjob; MPI_Status status; do { MPI_Recv(&myjob,1,MPI_INT,0,tag, MPI_COMM_WORLD,&status); if(v == 1) printf("node %d received %d\n",i,myjob); if(myjob == -1) break; char c = 'a' + ((char)i); if(v == 1) printf("-> %d computes %c\n",i,c); if(v == 1) printf("node %d sends %c\n",i,c); MPI_Send(&c,1,MPI_CHAR,0,tag,MPI_COMM_WORLD); } while(myjob != -1); return 0; } Following is the code for the manager. :: int manager ( int p, int n ) { char result[n+1]; int job = -1; int j; do { for(j=1; j
= n) break; int d = 1 + (job % (p-1)); if(v == 1) printf("sending %d to %d\n",job,d); MPI_Send(&job,1,MPI_INT,d,tag,MPI_COMM_WORLD); } if(job >= n) break; for(j=1; j
0: worker(rank, verbose) else: manager(size, size*(size-1), verbose) Each workers receives a job and sends back a character corresponding to the received number. :: def worker(i, verbose=True): if verbose: print('Hello from worker', i) while True: nbr = COMM.recv(source=0, tag=11) if verbose: print('-> worker', i, 'received', nbr) if nbr == -1: break chrnbr = chr(ord('a') + nbr) if verbose: print('-> worker', i, 'computed', chrnbr) COMM.send(chrnbr, dest=0, tag=11) The manager distributes jobs, as defined below. :: def manager(npr, njobs, verbose=True): if verbose: print('Manager distributes', njobs, 'jobs to', npr-1, 'workers') result = '' jobcnt = 0 while jobcnt < njobs: for i in range(1, npr): jobcnt = jobcnt + 1 nbr = 1 + (jobcnt % (npr-1)) if verbose: print('-> manager sends job', jobcnt, 'to worker', i) COMM.send(nbr, dest=i, tag=11) for i in range(1, npr): data = COMM.recv(source=i, tag=11) if verbose: print('-> manager received', data, 'from worker', i) result = result + data for i in range(1, npr): if verbose: print('-> manager sends -1 to worker', i) COMM.send(-1, dest=i, tag=11) print('the result :', result) print('number of characters :', len(result)) print(' number of jobs :', njobs) Dynamic Work Load Balancing --------------------------- .. index:: dynamic work load balancing, job scheduling Consider scheduling 8 jobs on 2 processors, as in :numref:`figjobscheduling`. .. _figjobscheduling: .. figure:: ./figjobscheduling.png Scheduling 8 jobs on 2 processors. In a worst case scenario, with static job scheduling, all the long jobs end up at one processor, while the short ones at the other, creating an uneven work load. In scheduling :math:`n` jobs on :math:`p` processors, :math:`n \gg p`, node 0 manages the job queue, nodes 1 to :math:`p-1` are compute nodes. The manager executes the following algorithm: :: for j from 1 to p-1 do send job j-1 to compute node j; while not all jobs are done do if a node is done with a job then collect result from node; if there is still a job left to do then send next job to node; else send termination signal. .. index:: nonblocking communication, MPI_Iprobe To check for incoming messages, the nonblocking (or Immediate) MPI command is explained in :numref:`tabiprobe`. .. _tabiprobe: .. table:: Syntax and arguments of MPI_Iprobe. +-------------------------------------------+ | MPI_Iprobe(source,tag,comm,flag,status) | +========+==================================+ | source | rank of source or MPI_ANY_SOURCE | +--------+----------------------------------+ | tag | message tag or MPI_ANY_TAG | +--------+----------------------------------+ | comm | communicator | +--------+----------------------------------+ | flag | address of logical variable | +--------+----------------------------------+ | status | status object | +--------+----------------------------------+ If ``flag`` is true on return, then ``status`` contains the rank of the source of the message and can be received. The manager starts with distributing the first :math:`p-1` jobs, as shown below. :: int manager ( int p, int n ) { char result[n+1]; int j; for(j=1; j
= p-1. """ """ function worker(i::Int, verbose::Bool=true) The i-th worker receives a number. The worker terminates if the number is -1, otherwise it sends to the manager the corresponding character following 'a'. """ The main program is defined next. :: """ function main(verbose::Bool=true) runs a manager/worker dynamic load distribution. """ function main(verbose::Bool=true) myid = MPI.Comm_rank(COMM) size = MPI.Comm_size(COMM) if myid == 0 print("Give the number of jobs : ") line = readline(stdin) njobs = parse(Int, line) end MPI.Barrier(COMM) if myid == 0 manager(size, njobs) else worker(myid) end MPI.Barrier(COMM) end The function which defines the actions of each worker is given below. :: function worker(i::Int, verbose::Bool=true) println("Worker ", i, " says hello.") while true nbr = MPI.recv(COMM; source=0, tag=11) println("-> worker ", i, " received ", nbr) if nbr == -1 break end chrnbr = Char(Int('a') + nbr) MPI.send(chrnbr, COMM; dest=0, tag=11) end end The manager sends the first jobs and then enters a loop to send the next jobs. Observe the use of ``Iprobe``. :: function manager(p::Int, n::Int, verbose::Bool=true) if verbose println("Manager distributes ", n, " jobs to ", p-1, " workers ...") end result = "" for j=1:p-1 println("-> manager sends job ", j, " to worker ", j) MPI.send(j, COMM; dest=j, tag=11) end jobcnt = p-1 # sent already p-1 jobs done = 0 # counts workers that are done while done < p-1 for i=1:p-1 messageSent = MPI.Iprobe(COMM; source=i) if messageSent data = MPI.recv(COMM; source=i) println("-> manager received ", data, " from ", i) result = string(result, data) jobcnt = jobcnt + 1 if jobcnt > n MPI.send(-1, COMM; dest=i, tag=11) done = done + 1 else nbr = 1 + (jobcnt % p) MPI.send(nbr, COMM; dest=i, tag=11) end end end end println("result : ", result) println("number of characters : ", length(result)) println(" number of jobs : ", n) Scalability ----------- Introducing load balancing we applied the manager/worker model to schedule jobs before executing (static) and during execution (dynamic). This model works well for a modest number of processors. For thousands of processors, one single manager may no longer be capable to obtain good load balancing. Obtaining an optimal load balancing is NP-complete. The survey paper of Kwok and Ahmad describes several heuristics to statically schedule jobs. Nearest neighbor load balancing methods iteratively strive to obtain a global optimal work load distribution. Among the deterministic algorithms are diffusion, dimension exchange, and the gradient model. The diffusion method was modeled by Cybenko using linear system theories. Taking into account the topologies of the networks, results from graph theory are applied in the convergence analysis, as explained in the book of Xu and Lau. Bibliography ------------ 1. Selim G. Akl. **Superlinear performance in real-time parallel computation.** *The Journal of Supercomputing* 29(1):89--111, 2004. 2. George Cybenko. **Dynamic Load Balancing for Distributed Memory Processors**. *Journal of Parallel and Distributed Computing* 7, 279-301, 1989. 3. Yu-Kwong Kwok and Ishfaq Ahmad. **Static scheduling algorithms for allocating directed task graphs to multiprocessor.** *ACM Computing Surveys*, 31(4):406--469, 1999. 4. C. McCreary and H. Gill. **Automatic determination of grain size for efficient parallel processing.** *Communications of the ACM*, 32(9):1073--1078, 1989. 5. Chengzhong Xu and Francis C.M. Lau. *Load Balancing in Parallel Computers. Theory and Practice.* Kluwer Academic Publishers, 1997. Exercises --------- 1. Apply the manager/worker algorithm for static load assignment to the computation of the Mandelbrot set. What is the speedup for 2, 4, and 8 compute nodes? To examine the work load of every worker, use an array to store the total number of iterations computed by every worker. 2. Apply the manager/worker algorithm for dynamic load balancing to the computation of the Mandelbrot set. What is the speedup for 2, 4, and 8 compute nodes? To examine the work load of every worker, use an array to store the total number of iterations computed by every worker. 3. Compare the performance of static load assignment with dynamic load balancing for the Mandelbrot set. Compare both the speedups and the work loads for every worker.