Using MPI
=========

We illustrate the collective communication commands to scatter data
and gather results.  Point-to-point communication happens via a send
and a recv (receive) command.

Scatter and Gather
------------------

Consider the addition of 100 numbers
on a distributed memory 4-processor computer.
For simplicity of coding,
we sum the first one hundred positive integers
and compute 

.. math::

   S = \sum_{i=1}^{100} i.

A parallel algorithm to sum 100 numbers proceeds in four stages:

1. distribute 100 numbers evenly among the 4 processors;
2. Every processor sums 25 numbers;
3. Collect the 4 sums to the manager node; and
4. Add the 4 sums and print the result.

Scattering an array of 100 number over 4 processors
and gathering the partial sums at the 4 processors to the root
is displayed in :numref:`figscattergather`.

.. _figscattergather:

.. figure:: ./figscattergather.png
    :align: center

    Scattering data and gathering results.

The scatter and gather are of the 
:index:`collective communication` type,
as every process in the universe participates in this operation.
The MPI commands to :index:`scatter` and :index:`gather` are respectively
``MPI_Scatter`` and ``MPI_Gather``.

The specifications of the MPI command
to scatter data from one member to all members of a group are
described in :numref:`tabmpiscatter`.
The specifications of the MPI command
to gather data from all members to one member in a group
are listed :numref:`tabmpigather`.

.. index:: MPI_SCATTER

.. _tabmpiscatter:

.. table:: Arguments of the ``MPI_Scatter`` command.

   +------------------------------------------------------------------------------+
   | MPI_SCATTER(sendbuf,sendcount,sendtype,recvbuf,recvcount,recvtype,root,comm) |
   +===========+==================================================================+
   | sendbuf   | address of send buffer                                           |
   +-----------+------------------------------------------------------------------+
   | sendcount | number of elements sent to each process                          |
   +-----------+------------------------------------------------------------------+
   | sendtype  | data type of send buffer elements                                |
   +-----------+------------------------------------------------------------------+
   | recvbuf   | address of receive buffer                                        |
   +-----------+------------------------------------------------------------------+
   | recvcount | number of elements in receive buffer                             |
   +-----------+------------------------------------------------------------------+
   | recvtype  | data type of receive buffer elements                             |
   +-----------+------------------------------------------------------------------+
   | root      | rank of sending process                                          |
   +-----------+------------------------------------------------------------------+
   | comm      | communicator                                                     |
   +-----------+------------------------------------------------------------------+

.. index:: MPI_GATHER

.. _tabmpigather:

.. table:: Arguments of the ``MPI_Gather`` command.

   +-----------------------------------------------------------------------------+
   | MPI_GATHER(sendbuf,sendcount,sendtype,recvbuf,recvcount,recvtype,root,comm) |
   +===========+=================================================================+
   | sendbuf   | starting address of send buffer                                 |
   +-----------+-----------------------------------------------------------------+
   | sendcount | number of elements in send buffer                               |
   +-----------+-----------------------------------------------------------------+
   | sendtype  | data buffer of send buffer elements                             |
   +-----------+-----------------------------------------------------------------+
   | recvbuf   | address of receive buffer                                       |
   +-----------+-----------------------------------------------------------------+
   | recvcount | number of elements for any single receive                       |
   +-----------+-----------------------------------------------------------------+
   | recvtype  | data type of receive buffer elements                            |
   +-----------+-----------------------------------------------------------------+
   | root      | rank of receiving process                                       |
   +-----------+-----------------------------------------------------------------+
   | comm      | communicator                                                    |
   +-----------+-----------------------------------------------------------------+


The code for parallel summation, in the program ``parallel_sum.c``,
illustrates the scatter and the gather.

::

   #include <stdio.h>
   #include <stdlib.h>
   #include <mpi.h>

   #define v 1 /* verbose flag, output if 1, no output if 0 */

   int main ( int argc, char *argv[] )
   {
      int myid,j,*data,tosum[25],sums[4];

      MPI_Init(&argc,&argv);
      MPI_Comm_rank(MPI_COMM_WORLD,&myid);

      if(myid==0) /* manager allocates and initializes the data */
      {
         data = (int*)calloc(100,sizeof(int));
         for (j=0; j<100; j++) data[j] = j+1;
         if(v>0)
         {
            printf("The data to sum : ");
            for (j=0; j<100; j++) printf(" %d",data[j]); printf("\n");
         }
      }

      MPI_Scatter(data,25,MPI_INT,tosum,25,MPI_INT,0,MPI_COMM_WORLD);

      if(v>0) /* after the scatter, every node has 25 numbers to sum */
      {
         printf("Node %d has numbers to sum :",myid);
         for(j=0; j<25; j++) printf(" %d", tosum[j]);
         printf("\n");
      }
      sums[myid] = 0;
      for(j=0; j<25; j++) sums[myid] += tosum[j];
      if(v>0) printf("Node %d computes the sum %d\n",myid,sums[myid]);

      MPI_Gather(&sums[myid],1,MPI_INT,sums,1,MPI_INT,0,MPI_COMM_WORLD);

      if(myid==0) /* after the gather, sums contains the four sums */
      {
         printf("The four sums : ");
         printf("%d",sums[0]);
         for(j=1; j<4; j++) printf(" + %d", sums[j]);
         for(j=1; j<4; j++) sums[0] += sums[j];
         printf(" = %d, which should be 5050.\n",sums[0]);
      }
      MPI_Finalize();
      return 0;
   }

Send and Recv
-------------

To illustrate :index:`point-to-point communication`,
we consider the problem of squaring numbers in an array.
An example of an input sequence 
is :math:`2, 4, 8, 16, \ldots` with
corresponding output sequence :math:`4, 16, 64, 256, \ldots`.
Instead of squaring,
we could apply a difficult function :math:`y = f(x)`
to an array of values for :math:`x`.
A session with the parallel code with 4 processes runs as

::

   $ mpirun -np 4 /tmp/parallel_square
   The data to square :  2 4 8 16
   Node 1 will square 4
   Node 2 will square 8
   Node 3 will square 16
   The squared numbers :  4 16 64 256
   $

Applying a parallel squaring algorithm to square :math:`p` numbers 
runs in three stages:

1. The manager sends :math:`p-1` numbers 
   :math:`x_1, x_2, \ldots,x_{p-1}` to workers.
   Every worker receives: 
   the :math:`i`-th worker receives :math:`x_i` in :math:`f`.
   The manager copies :math:`x_0` to :math:`f`: :math:`f = x_0`.

2. Every node (manager and all workers) squares :math:`f`.

3. Every worker sends :math:`f` to the manager.
   The manager receives :math:`x_i` from :math:`i`-th worker,
   :math:`i=1,2,\ldots,p-1`.
   The manager copies :math:`f` to :math:`x_0`:
   :math:`x_0 = f`, and prints.

To perform point-to-point communication with MPI are
``MPI_Send`` and ``MPI_Recv``.
The syntax for the :index:`blocking send` operation is 
in :numref:`tabmpisend`.
:numref:`tabmpirecv` explains the :index:`blocking receive` operation.

.. index:: MPI_SEND

.. _tabmpisend:

.. table:: The ``MPI_SEND`` command.

   +--------------------------------------------------+
   | MPI_SEND(buf,count,datatype,dest,tag,comm)       |
   +==========+=======================================+
   | buf      | initial address of the send buffer    |
   +----------+---------------------------------------+
   | count    | number of elements in send buffer     |
   +----------+---------------------------------------+
   | datatype | data type of each send buffer element |
   +----------+---------------------------------------+
   | dest     | rank of destination                   |
   +----------+---------------------------------------+
   | tag      | message tag                           |
   +----------+---------------------------------------+
   | comm     | communication                         |
   +----------+---------------------------------------+

.. index:: MPI_RECV

.. _tabmpirecv:

.. table:: The ``MPI_RECV`` command.

   +-----------------------------------------------------+
   | MPI_RECV(buf,count,datatype,source,tag,comm,status) |
   +==========+==========================================+
   | buf      | initial address of the receive buffer    |
   +----------+------------------------------------------+
   | count    | number of elements in receive  buffer    |
   +----------+------------------------------------------+
   | datatype | data type of each receive buffer element |
   +----------+------------------------------------------+
   | source   | rank of source                           |
   +----------+------------------------------------------+
   | tag      | message tag                              |
   +----------+------------------------------------------+
   | comm     | communication                            |
   +----------+------------------------------------------+
   | status   | status object                            |
   +----------+------------------------------------------+

Code for a parallel square is below.
Every ``MPI_Send`` is matched by a ``MPI_Recv``.
Observe that there are two loops in the code.
One loop is explicitly executed by the root. 
The other, implicit loop, is executed by the
``mpiexec -n p`` command.

::

   #include <stdio.h>
   #include <stdlib.h>
   #include <mpi.h>

   #define v 1 /* verbose flag, output if 1, no output if 0 */
   #define tag 100 /* tag for sending a number */

   int main ( int argc, char *argv[] )
   {
      int p,myid,i,f,*x;
      MPI_Status status;
      MPI_Init(&argc,&argv);
      MPI_Comm_size(MPI_COMM_WORLD,&p);
      MPI_Comm_rank(MPI_COMM_WORLD,&myid);
      if(myid == 0) /* the manager allocates and initializes x */
      {
         x = (int*)calloc(p,sizeof(int));
         x[0] = 2;
         for (i=1; i<p; i++) x[i] = 2*x[i-1];
         if(v>0)
         {
            printf("The data to square : ");
            for (i=0; i<p; i++) printf(" %d",x[i]); printf("\n");
         }
      }
      if(myid == 0) /* the manager copies x[0] to f */
      {             /* and sends the i-th element to the i-th processor */
         f = x[0]; 
         for(i=1; i<p; i++) MPI_Send(&x[i],1,MPI_INT,i,tag,MPI_COMM_WORLD);
      }
      else /* every worker receives its f from root */
      {
         MPI_Recv(&f,1,MPI_INT,0,tag,MPI_COMM_WORLD,&status);
         if(v>0) printf("Node %d will square %d\n",myid,f);
      }
      f *= f;       /* every node does the squaring */
      if(myid == 0) /* the manager receives f in x[i] from processor i */
         for(i=1; i<p; i++)
            MPI_Recv(&x[i],1,MPI_INT,i,tag,MPI_COMM_WORLD,&status);
      else     /* every worker sends f to the manager */
         MPI_Send(&f,1,MPI_INT,0,tag,MPI_COMM_WORLD);
      if(myid == 0) /* the manager prints results */
      {
         x[0] = f;
         printf("The squared numbers : ");
         for(i=0; i<p; i++) printf(" %d",x[i]); printf("\n");
      }
      MPI_Finalize();
      return 0;
   }

Reducing the Communication Cost
-------------------------------

The *wall time* refers to the time elapsed on the clock
that hangs on the wall, that is: the real time, which measures
everything, not just the time the processors were busy.
To measure the communication cost, 
we run our parallel program without any computations.
``MPI_Wtime()`` returns a double containing the elapsed time
in seconds since some arbitrary time in the past.
An example of its use is below.

.. index:: MPI_Wtime

::

   double startwtime,endwtime,totalwtime;
   startwtime = MPI_Wtime();
   /* code to be timed */
   endwtime = MPI_Wtime();
   totalwtime = endwtime - startwtime;

A lot of time in a parallel program can be spent on communication.
Broadcasting over 8 processors sequentially takes 8 stages.
In a :index:`fan out broadcast`, the 8 stages are reduced to 3.
:numref:`figseqfanbroadcast` illustrates the sequential
and fan out broadcast.

.. _figseqfanbroadcast:

.. figure:: ./figseqfanbroadcast.png

   Sequential (left) versus fan out (right) broadcast.

The story of :numref:`figseqfanbroadcast` can be told as follows.
Consider the distribution of a pile of 8 pages among 8 people.
We can do this in three stages:

1. Person 0 splits the pile keeps 4 pages, hands 4 pages to person 1.

2. Person 0 splits the pile keeps 2 pages, hands 2 pages to person 2.

   Person 1 splits the pile keeps 2 pages, hands 2 pages to person 3.

3. Person 0 splits the pile keeps 1 page,  hands 1 pages to person 4.

   Person 1 splits the pile keeps 1 page,  hands 1 pages to person 5.

   Person 2 splits the pile keeps 1 page,  hands 1 pages to person 6.

   Person 3 splits the pile keeps 1 page,  hands 1 pages to person 7.

Already from this simple example, we observe the pattern needed
to formalize the algorithm.
At stage :math:`k`, processor :math:`i` communicates with the processor
with identification number :math:`i + 2^k`.

The algorithm for fan out broadcast has a short description,
shown below.

::

   Algorithm: at step k, 2**(k-1) processors have data, and execute:

   for j from 0 to 2**(k-1) do
       processor j sends to processor j + 2**(k-1);
       processor j+2**(k-1) receives from processor j.

The cost to broadcast of one item
is :math:`O(p)` for a sequential broadcast,
is :math:`O(\log_2(p))` for a fan out broadcast.
The cost to scatter :math:`n` items
is :math:`O(p \times n/p)` for a sequential broadcast,
is :math:`O(\log_2(p) \times n/p)` for a fan out broadcast.

Point-to-Point Communication with MPI for Python
------------------------------------------------

In MPI for Python we call the methods ``send`` and ``recv``
for point-to-point communication.
Process 0 sends ``DATA`` to process 1:

::

   MPI.COMM_WORLD.send(DATA, dest=1, tag=2)

Every ``send`` must have a matching ``recv``.
For the script to continue, process 1 must do

::

   DATA = MPI.COMM_WORLD.recv(source=0, tag=2)

mpi4py uses ``pickle`` on Python objects.
The user can declare the MPI types explicitly.

What appears on screen running the Python script is below.

::

   $ mpiexec -n 2 python mpi4py_point2point.py
   0 sends {'a': 7, 'b': 3.14} to 1
   1 received {'a': 7, 'b': 3.14} from 0


The script ``mpi4pi_point2point.py`` is below.

::

   from mpi4py import MPI

   COMM = MPI.COMM_WORLD
   RANK = COMM.Get_rank()

   if(RANK == 0):
       DATA = {'a': 7, 'b': 3.14}
       COMM.send(DATA, dest=1, tag=11)
       print RANK, 'sends', DATA, 'to 1'
   elif(RANK == 1):
       DATA = COMM.recv(source=0, tag=11)
       print RANK, 'received', DATA, 'from 0'

With ``mpi4py`` we can either rely on Python's dynamic typing
or declare types explicitly when processing ``numpy`` arrays.
To sum an array of numbers, we distribute the
numbers among the processes that compute the sum of a slice.
The sums of the slices are sent to process 0
which computes the total sum.  
The code for the script is \ref{figmpi4pyparsum} and
what appears on screen when the script runs is below.

::

   $ mpiexec -n 10 python mpi4py_parallel_sum.py
   0 has data [0 1 2 3 4 5 6 7 8 9] sum = 45
   2 has data [20 21 22 23 24 25 26 27 28 29] sum = 245
   3 has data [30 31 32 33 34 35 36 37 38 39] sum = 345
   4 has data [40 41 42 43 44 45 46 47 48 49] sum = 445
   5 has data [50 51 52 53 54 55 56 57 58 59] sum = 545
   1 has data [10 11 12 13 14 15 16 17 18 19] sum = 145
   8 has data [80 81 82 83 84 85 86 87 88 89] sum = 845
   9 has data [90 91 92 93 94 95 96 97 98 99] sum = 945
   7 has data [70 71 72 73 74 75 76 77 78 79] sum = 745
   6 has data [60 61 62 63 64 65 66 67 68 69] sum = 645
   total sum = 4950

The code for the script ``mpi4py_parallel_sum.py`` follows.

::

   from mpi4py import MPI
   import numpy as np

   COMM = MPI.COMM_WORLD
   RANK = COMM.Get_rank()
   SIZE = COMM.Get_size()
   N = 10

   if(RANK == 0):
       DATA = np.arange(N*SIZE, dtype='i')
       for i in range(1, SIZE):
           SLICE = DATA[i*N:(i+1)*N]
           COMM.Send([SLICE, MPI.INT], dest=i)
       MYDATA = DATA[0:N]
   else:
       MYDATA = np.empty(N, dtype='i')
       COMM.Recv([MYDATA, MPI.INT], source=0)

   S = sum(MYDATA)
   print RANK, 'has data', MYDATA, 'sum =', S

   SUMS = np.zeros(SIZE, dtype='i')
   if(RANK > 0):
       COMM.send(S, dest=0)
   else:
       SUMS[0] = S
       for i in range(1, SIZE):
           SUMS[i] = COMM.recv(source=i)
       print 'total sum =', sum(SUMS)

Recall that Python is case sensitive and the distinction between
``Send`` and ``send``, and between ``Recv`` and ``recv``
is important.  In particular,
``COMM.send`` and ``COMM.recv`` have no type declarations,
whereas
``COMM.Send`` and ``COMM.Recv`` have type declarations.

Bibliography
------------

1. L. Dalcin, R. Paz, and M. Storti.  **MPI for Python**.
   *Journal of Parallel and Distributed Computing*, 65:1108--1115, 2005.

2. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra.
   *MPI - The Complete Reference Volume 1, The MPI Core*.
   Massachusetts Institute of Technology, second edition, 1998.

Exercises
---------

1. Adjust the parallel summation to work for :math:`p` processors
   where the dimension :math:`n` of the array is a multiple of :math:`p`.

2. Use C or Python to rewrite the program to sum 100 numbers 
   using ``MPI_Send`` and ``MPI_Recv``
   instead of ``MPI_Scatter`` and ``MPI_Gather``.

3. Use C or Python to rewrite the program to square :math:`p` numbers 
   using ``MPI_Scatter`` and ``MPI_Gather``.

4. Show that a hypercube network topology has enough 
   direct connections between processors for a fan out broadcast.