Using MPI

We illustrate the collective communication commands to scatter data and gather results. Point-to-point communication happens via a send and a recv (receive) command.

Scatter and Gather

Consider the addition of 100 numbers on a distributed memory 4-processor computer. For simplicity of coding, we sum the first one hundred positive integers and compute

\[S = \sum_{i=1}^{100} i.\]

A parallel algorithm to sum 100 numbers proceeds in four stages:

  1. distribute 100 numbers evenly among the 4 processors;
  2. Every processor sums 25 numbers;
  3. Collect the 4 sums to the manager node; and
  4. Add the 4 sums and print the result.

Scattering an array of 100 number over 4 processors and gathering the partial sums at the 4 processors to the root is displayed in Fig. 18.

_images/figscattergather.png

Fig. 18 Scattering data and gathering results.

The scatter and gather are of the collective communication type, as every process in the universe participates in this operation. The MPI commands to scatter and gather are respectively MPI_Scatter and MPI_Gather.

The specifications of the MPI command to scatter data from one member to all members of a group are described in Table 2. The specifications of the MPI command to gather data from all members to one member in a group are listed Table 3.

Table 2 Arguments of the MPI_Scatter command.
MPI_SCATTER(sendbuf,sendcount,sendtype,recvbuf,recvcount,recvtype,root,comm)
sendbuf address of send buffer
sendcount number of elements sent to each process
sendtype data type of send buffer elements
recvbuf address of receive buffer
recvcount number of elements in receive buffer
recvtype data type of receive buffer elements
root rank of sending process
comm communicator
Table 3 Arguments of the MPI_Gather command.
MPI_GATHER(sendbuf,sendcount,sendtype,recvbuf,recvcount,recvtype,root,comm)
sendbuf starting address of send buffer
sendcount number of elements in send buffer
sendtype data buffer of send buffer elements
recvbuf address of receive buffer
recvcount number of elements for any single receive
recvtype data type of receive buffer elements
root rank of receiving process
comm communicator

The code for parallel summation, in the program parallel_sum.c, illustrates the scatter and the gather.

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

#define v 1 /* verbose flag, output if 1, no output if 0 */

int main ( int argc, char *argv[] )
{
   int myid,j,*data,tosum[25],sums[4];

   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD,&myid);

   if(myid==0) /* manager allocates and initializes the data */
   {
      data = (int*)calloc(100,sizeof(int));
      for (j=0; j<100; j++) data[j] = j+1;
      if(v>0)
      {
         printf("The data to sum : ");
         for (j=0; j<100; j++) printf(" %d",data[j]); printf("\n");
      }
   }

   MPI_Scatter(data,25,MPI_INT,tosum,25,MPI_INT,0,MPI_COMM_WORLD);

   if(v>0) /* after the scatter, every node has 25 numbers to sum */
   {
      printf("Node %d has numbers to sum :",myid);
      for(j=0; j<25; j++) printf(" %d", tosum[j]);
      printf("\n");
   }
   sums[myid] = 0;
   for(j=0; j<25; j++) sums[myid] += tosum[j];
   if(v>0) printf("Node %d computes the sum %d\n",myid,sums[myid]);

   MPI_Gather(&sums[myid],1,MPI_INT,sums,1,MPI_INT,0,MPI_COMM_WORLD);

   if(myid==0) /* after the gather, sums contains the four sums */
   {
      printf("The four sums : ");
      printf("%d",sums[0]);
      for(j=1; j<4; j++) printf(" + %d", sums[j]);
      for(j=1; j<4; j++) sums[0] += sums[j];
      printf(" = %d, which should be 5050.\n",sums[0]);
   }
   MPI_Finalize();
   return 0;
}

Send and Recv

To illustrate point-to-point communication, we consider the problem of squaring numbers in an array. An example of an input sequence is \(2, 4, 8, 16, \ldots\) with corresponding output sequence \(4, 16, 64, 256, \ldots\). Instead of squaring, we could apply a difficult function \(y = f(x)\) to an array of values for \(x\). A session with the parallel code with 4 processes runs as

$ mpirun -np 4 /tmp/parallel_square
The data to square :  2 4 8 16
Node 1 will square 4
Node 2 will square 8
Node 3 will square 16
The squared numbers :  4 16 64 256
$

Applying a parallel squaring algorithm to square \(p\) numbers runs in three stages:

  1. The manager sends \(p-1\) numbers \(x_1, x_2, \ldots,x_{p-1}\) to workers. Every worker receives: the \(i\)-th worker receives \(x_i\) in \(f\). The manager copies \(x_0\) to \(f\): \(f = x_0\).
  2. Every node (manager and all workers) squares \(f\).
  3. Every worker sends \(f\) to the manager. The manager receives \(x_i\) from \(i\)-th worker, \(i=1,2,\ldots,p-1\). The manager copies \(f\) to \(x_0\): \(x_0 = f\), and prints.

To perform point-to-point communication with MPI are MPI_Send and MPI_Recv. The syntax for the blocking send operation is in Table 4. Table 5 explains the blocking receive operation.

Table 4 The MPI_SEND command.
MPI_SEND(buf,count,datatype,dest,tag,comm)
buf initial address of the send buffer
count number of elements in send buffer
datatype data type of each send buffer element
dest rank of destination
tag message tag
comm communication
Table 5 The MPI_RECV command.
MPI_RECV(buf,count,datatype,source,tag,comm,status)
buf initial address of the receive buffer
count number of elements in receive buffer
datatype data type of each receive buffer element
source rank of source
tag message tag
comm communication
status status object

Code for a parallel square is below. Every MPI_Send is matched by a MPI_Recv. Observe that there are two loops in the code. One loop is explicitly executed by the root. The other, implicit loop, is executed by the mpiexec -n p command.

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

#define v 1 /* verbose flag, output if 1, no output if 0 */
#define tag 100 /* tag for sending a number */

int main ( int argc, char *argv[] )
{
   int p,myid,i,f,*x;
   MPI_Status status;
   MPI_Init(&argc,&argv);
   MPI_Comm_size(MPI_COMM_WORLD,&p);
   MPI_Comm_rank(MPI_COMM_WORLD,&myid);
   if(myid == 0) /* the manager allocates and initializes x */
   {
      x = (int*)calloc(p,sizeof(int));
      x[0] = 2;
      for (i=1; i<p; i++) x[i] = 2*x[i-1];
      if(v>0)
      {
         printf("The data to square : ");
         for (i=0; i<p; i++) printf(" %d",x[i]); printf("\n");
      }
   }
   if(myid == 0) /* the manager copies x[0] to f */
   {             /* and sends the i-th element to the i-th processor */
      f = x[0];
      for(i=1; i<p; i++) MPI_Send(&x[i],1,MPI_INT,i,tag,MPI_COMM_WORLD);
   }
   else /* every worker receives its f from root */
   {
      MPI_Recv(&f,1,MPI_INT,0,tag,MPI_COMM_WORLD,&status);
      if(v>0) printf("Node %d will square %d\n",myid,f);
   }
   f *= f;       /* every node does the squaring */
   if(myid == 0) /* the manager receives f in x[i] from processor i */
      for(i=1; i<p; i++)
         MPI_Recv(&x[i],1,MPI_INT,i,tag,MPI_COMM_WORLD,&status);
   else     /* every worker sends f to the manager */
      MPI_Send(&f,1,MPI_INT,0,tag,MPI_COMM_WORLD);
   if(myid == 0) /* the manager prints results */
   {
      x[0] = f;
      printf("The squared numbers : ");
      for(i=0; i<p; i++) printf(" %d",x[i]); printf("\n");
   }
   MPI_Finalize();
   return 0;
}

Reducing the Communication Cost

The wall time refers to the time elapsed on the clock that hangs on the wall, that is: the real time, which measures everything, not just the time the processors were busy. To measure the communication cost, we run our parallel program without any computations. MPI_Wtime() returns a double containing the elapsed time in seconds since some arbitrary time in the past. An example of its use is below.

double startwtime,endwtime,totalwtime;
startwtime = MPI_Wtime();
/* code to be timed */
endwtime = MPI_Wtime();
totalwtime = endwtime - startwtime;

A lot of time in a parallel program can be spent on communication. Broadcasting over 8 processors sequentially takes 8 stages. In a fan out broadcast, the 8 stages are reduced to 3. Fig. 19 illustrates the sequential and fan out broadcast.

_images/figseqfanbroadcast.png

Fig. 19 Sequential (left) versus fan out (right) broadcast.

The story of Fig. 19 can be told as follows. Consider the distribution of a pile of 8 pages among 8 people. We can do this in three stages:

  1. Person 0 splits the pile keeps 4 pages, hands 4 pages to person 1.

  2. Person 0 splits the pile keeps 2 pages, hands 2 pages to person 2.

    Person 1 splits the pile keeps 2 pages, hands 2 pages to person 3.

  3. Person 0 splits the pile keeps 1 page, hands 1 pages to person 4.

    Person 1 splits the pile keeps 1 page, hands 1 pages to person 5.

    Person 2 splits the pile keeps 1 page, hands 1 pages to person 6.

    Person 3 splits the pile keeps 1 page, hands 1 pages to person 7.

Already from this simple example, we observe the pattern needed to formalize the algorithm. At stage \(k\), processor \(i\) communicates with the processor with identification number \(i + 2^k\).

The algorithm for fan out broadcast has a short description, shown below.

Algorithm: at step k, 2**(k-1) processors have data, and execute:

for j from 0 to 2**(k-1) do
    processor j sends to processor j + 2**(k-1);
    processor j+2**(k-1) receives from processor j.

The cost to broadcast of one item is \(O(p)\) for a sequential broadcast, is \(O(\log_2(p))\) for a fan out broadcast. The cost to scatter \(n\) items is \(O(p \times n/p)\) for a sequential broadcast, is \(O(\log_2(p) \times n/p)\) for a fan out broadcast.

Point-to-Point Communication with MPI for Python

In MPI for Python we call the methods send and recv for point-to-point communication. Process 0 sends DATA to process 1:

MPI.COMM_WORLD.send(DATA, dest=1, tag=2)

Every send must have a matching recv. For the script to continue, process 1 must do

DATA = MPI.COMM_WORLD.recv(source=0, tag=2)

mpi4py uses pickle on Python objects. The user can declare the MPI types explicitly.

What appears on screen running the Python script is below.

$ mpiexec -n 2 python mpi4py_point2point.py
0 sends {'a': 7, 'b': 3.14} to 1
1 received {'a': 7, 'b': 3.14} from 0

The script mpi4pi_point2point.py is below.

from mpi4py import MPI

COMM = MPI.COMM_WORLD
RANK = COMM.Get_rank()

if(RANK == 0):
    DATA = {'a': 7, 'b': 3.14}
    COMM.send(DATA, dest=1, tag=11)
    print RANK, 'sends', DATA, 'to 1'
elif(RANK == 1):
    DATA = COMM.recv(source=0, tag=11)
    print RANK, 'received', DATA, 'from 0'

With mpi4py we can either rely on Python’s dynamic typing or declare types explicitly when processing numpy arrays. To sum an array of numbers, we distribute the numbers among the processes that compute the sum of a slice. The sums of the slices are sent to process 0 which computes the total sum. The code for the script is ref{figmpi4pyparsum} and what appears on screen when the script runs is below.

$ mpiexec -n 10 python mpi4py_parallel_sum.py
0 has data [0 1 2 3 4 5 6 7 8 9] sum = 45
2 has data [20 21 22 23 24 25 26 27 28 29] sum = 245
3 has data [30 31 32 33 34 35 36 37 38 39] sum = 345
4 has data [40 41 42 43 44 45 46 47 48 49] sum = 445
5 has data [50 51 52 53 54 55 56 57 58 59] sum = 545
1 has data [10 11 12 13 14 15 16 17 18 19] sum = 145
8 has data [80 81 82 83 84 85 86 87 88 89] sum = 845
9 has data [90 91 92 93 94 95 96 97 98 99] sum = 945
7 has data [70 71 72 73 74 75 76 77 78 79] sum = 745
6 has data [60 61 62 63 64 65 66 67 68 69] sum = 645
total sum = 4950

The code for the script mpi4py_parallel_sum.py follows.

from mpi4py import MPI
import numpy as np

COMM = MPI.COMM_WORLD
RANK = COMM.Get_rank()
SIZE = COMM.Get_size()
N = 10

if(RANK == 0):
    DATA = np.arange(N*SIZE, dtype='i')
    for i in range(1, SIZE):
        SLICE = DATA[i*N:(i+1)*N]
        COMM.Send([SLICE, MPI.INT], dest=i)
    MYDATA = DATA[0:N]
else:
    MYDATA = np.empty(N, dtype='i')
    COMM.Recv([MYDATA, MPI.INT], source=0)

S = sum(MYDATA)
print RANK, 'has data', MYDATA, 'sum =', S

SUMS = np.zeros(SIZE, dtype='i')
if(RANK > 0):
    COMM.send(S, dest=0)
else:
    SUMS[0] = S
    for i in range(1, SIZE):
        SUMS[i] = COMM.recv(source=i)
    print 'total sum =', sum(SUMS)

Recall that Python is case sensitive and the distinction between Send and send, and between Recv and recv is important. In particular, COMM.send and COMM.recv have no type declarations, whereas COMM.Send and COMM.Recv have type declarations.

Bibliography

  1. L. Dalcin, R. Paz, and M. Storti. MPI for Python. Journal of Parallel and Distributed Computing, 65:1108–1115, 2005.
  2. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI - The Complete Reference Volume 1, The MPI Core. Massachusetts Institute of Technology, second edition, 1998.

Exercises

  1. Adjust the parallel summation to work for \(p\) processors where the dimension \(n\) of the array is a multiple of \(p\).
  2. Use C or Python to rewrite the program to sum 100 numbers using MPI_Send and MPI_Recv instead of MPI_Scatter and MPI_Gather.
  3. Use C or Python to rewrite the program to square \(p\) numbers using MPI_Scatter and MPI_Gather.
  4. Show that a hypercube network topology has enough direct connections between processors for a fan out broadcast.