Basics of MPI
=============

Programming distributed memory parallel computers
happens through message passing.  In this lecture
we give basic examples of using the Message Passing Interface,
in C or Python.

One Single Program Executed by all Nodes
----------------------------------------

A parallel program is a collection of concurrent processes.
A process (also called a job or task) is a sequence of instructions.
Usually, there is a 1-to-1 map between processes and processors.
If there are more processes than processors,
then processes are executed in a time sharing environment.
We use the :index:`SPMD` model: :index:`Single Program, Multiple Data`.
Every node executes the same program.
Every node has a unique identification number (id)
--- the root node has number zero ---
and code can be executed depending on the id.
In a :index:`manager/worker model`,
the root node is the manager, the other nodes are workers.

The letters MPI stands for Message Passing Interface.
MPI is a standard specification for interprocess communication
for which several implementations exist.
When programming in C, we include the header

::

   #include <mpi.h>

to use the functionality of MPI.
``Open MPI`` is an open source implementation
of all features of MPI-2.
In this lecture we use MPI in simple interactive programs, e.g.:
as ``mpicc`` and ``mpirun`` are available on laptop computers.

Our first parallel program is ``mpi_hello_world``.
We use a ``makefile`` to compile, and then run with 3 processes.
Instead of ``mpirun -np 3`` we can also use ``mpiexec -n 3``.

::

   $ make mpi_hello_world
   mpicc mpi_hello_world.c -o /tmp/mpi_hello_world

   $ mpirun -np 3 /tmp/mpi_hello_world
   Hello world from processor 0 out of 3.
   Hello world from processor 1 out of 3.
   Hello world from processor 2 out of 3.
   $ 

To pass arguments to the :index:`MCA` modules
(MCA stands for Modular Component Architecture)
we can call ``mpirun -np`` (or ``mpiexec -n``)
with the option ``--mca`` such as

::

   mpirun --mca btl tcp,self -np 4 /tmp/mpi_hello_world

MCA modules have direct impact on MPI programs because they allow 
tunable parameters to be set at run time, such as 
* which BTL communication device driver to use, 
* what parameters to pass to that BTL, etc.
Note: BTL = Byte Transfer Layer.

The code of the program ``mpi_hello_world.c``
is listed below.

::

   #include <stdio.h>
   #include <mpi.h>

   int main ( int argc, char *argv[] )
   {
      int i,p;

      MPI_Init(&argc,&argv);
      MPI_Comm_size(MPI_COMM_WORLD,&p);
      MPI_Comm_rank(MPI_COMM_WORLD,&i);

      printf("Hello world from processor %d out of %d.\n",i,p);

      MPI_Finalize();

      return 0;
   }

Initialization, Finalization, and the Universe
----------------------------------------------

Let us look at some MPI constructions that are part of any program
that uses MPI.  Consider the beginning and the end of the program.


::

   #include <mpi.h>

   int main ( int argc, char *argv[] )
   {
      MPI_Init(&argc,&argv);
      MPI_Finalize();
      return 0;
   }

.. index:: MPI_Init, MPI_COMM_WORLD, MPI_Finalize, MPI_Comm_rank, MPI_Comm_size

The ``MPI_Init`` processes the command line arguments.
The value of ``argc`` is the number of arguments at the command line
and ``argv`` contains the arguments as strings of characters.
The first argument, ``argv[0]`` is the name of the program.
The cleaning up of the environment is done by ``MPI_Finalize()``.

``MPI_COMM_WORLD`` is a predefined named constant handle
to refer to the universe of *p* processors with labels from 0 to :math:`p-1`.
The number of processors is returned by ``MPI_Comm_size``
and ``MPI_Comm_rank`` returns the label of a node.
For example:

::

   int i,p;

   MPI_Comm_size(MPI_COMM_WORLD,&p);
   MPI_Comm_rank(MPI_COMM_WORLD,&i);

Broadcasting Data
-----------------

Many parallel programs follow a manager/worker model.
In a *broadcast* the same data is sent to all nodes.
A broadcast is an example of a :index:`collective communication`.
In a *collective communication*, all nodes participate in the
communication.

As an example, we :index:`broadcast` an integer.
Node with id 0 (manager) prompts for an integer.
The integer is *broadcasted* over the network
and the number is sent to all processors in the universe.
Every worker node prints the number to screen.
The typical application of broadcasting an integer is
the broadcast of the dimension of data before sending the data.

The compiling and running of the program goes as follows:

::

   $ make broadcast_integer
   mpicc broadcast_integer.c -o /tmp/broadcast_integer

   $ mpirun -np 3 /tmp/broadcast_integer
   Type an integer number... 
   123
   Node 1 writes the number n = 123.
   Node 2 writes the number n = 123.
   $

.. index:: MPI_Bcast

The command ``MPI_Bcast`` executes the broadcast.
An example of the ``MPI_Bcast`` command:

::

   int n;
   MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WORLD);

There are five arguments:

1. the address of the element(s) to broadcast;
2. the number of elements that will be broadcasted;
3. the type of all the elements;
4. a message label; and
5. the universe.

The full source listing of the program is shown below.

::

   #include <stdio.h>
   #include <mpi.h>

   void manager ( int* n );
   /* code executed by the manager node 0,
    * prompts the user for an integer number n */

   void worker ( int i, int n );
   /* code executed by the i-th worker node,
    * who will write the integer number n to screen */

   int main ( int argc, char *argv[] )
   {
      int myid,numbprocs,n;

      MPI_Init(&argc,&argv);
      MPI_Comm_size(MPI_COMM_WORLD,&numbprocs);
      MPI_Comm_rank(MPI_COMM_WORLD,&myid);

      if (myid == 0) manager(&n);
      
      MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WORLD);

      if (myid != 0) worker(myid,n);

      MPI_Finalize();

      return 0;
   }

   void manager ( int* n )
   {
      printf("Type an integer number... \n");
      scanf("%d",n);
   }

   void worker ( int i, int n )
   {
      printf("Node %d writes the number n = %d.\n",i,n);
   }

Moving Data from Manager to Workers
-----------------------------------

Often we want to broadcast an array of doubles.
The situation before broadcasting the dimension $n$ to all nodes
on a 4-processor distributed memory computer
is shown at the top left of :numref:`figbroadcastdata`.
After broadcasting of the dimension, each node *must*
allocate space to hold as many doubles as the dimension.

.. _figbroadcastdata:

.. figure:: ./figbroadcastdata.png
    :align: center

    On the schematic of a a distributed memory 4-processor computer,
    the top displays the situation before and after the broadcast of 
    the dimension.   After the broadcast of the dimension,
    each worker node allocates space for the array of doubles.
    The bottom two pictures display the situation before and after
    the broadcast of the array of doubles.

We go through the code step by step.
First we write the headers and the subroutine declarations.
We include ``stdlib.h`` for :index:`memory allocation`.

::

   #include <stdio.h>
   #include <stdlib.h>
   #include <mpi.h>

   void define_doubles ( int n, double *d );
   /* defines the values of the n doubles in d */

   void write_doubles ( int myid, int n, double *d );
   /* node with id equal to myid
      writes the n doubles in d */

The main function starts by broadcasting the dimension.

::

   int main ( int argc, char *argv[] )
   {
      int myid,numbprocs,n;
      double *data;

      MPI_Init(&argc,&argv);
      MPI_Comm_size(MPI_COMM_WORLD,&numbprocs);
      MPI_Comm_rank(MPI_COMM_WORLD,&myid);

      if (myid == 0)
      {
         printf("Type the dimension ...\n");
         scanf("%d",&n); 
      }
      MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WORLD);

The main program continues, allocating memory.
It is very important that *every* node performs
the memory allocation.

::

   data = (double*)calloc(n,sizeof(double));

   if (myid == 0) define_doubles(n,data);

   MPI_Bcast(data,n,MPI_DOUBLE,0,MPI_COMM_WORLD);

   if (myid != 0) write_doubles(myid,n,data);

   MPI_Finalize();
   return 0;

It is good programming practice to separate the
code that does not involve any MPI activity in subroutines.
The two subroutines are defined below.

::

   void define_doubles ( int n, double *d )
   {
      int i;

      printf("defining %d doubles ...\n", n);
      for(i=0; i < n; i++) d[i] = (double)i;
   }

   void write_doubles ( int myid, int n, double *d )
   {
      int i;

      printf("Node %d writes %d doubles : \n", myid,n);
      for(i=0; i < n; i++) printf("%lf\n",d[i]);
   }

MPI for Python
--------------

``MPI for Python`` provides bindings of MPI for Python,
allowing any Python program to exploit multiple processors.
It is available at ``http://code.google.com/p/mpi4py``,
with manual by Lisandro Dalcin: MPI for Python. 
The current Release 2.0.0 dates from July 2016.

.. index:: point-to-point communication, collective communication

The object oriented interface follows closely MPI-2 C++ bindings and
supports point-to-point and collective communications
of any pickable Python object,
as well as numpy arrays and builtin bytes, strings.
``mpi4py`` gives
the standard MPI *look and feel* in Python scripts
to develop parallel programs.
Often, only a small part of the code needs the
efficiency of a compiled language.
Python handles memory, errors, and user interaction. 

Our first script is again a *hello world*, shown below.

::

   from mpi4py import MPI

   SIZE = MPI.COMM_WORLD.Get_size()
   RANK = MPI.COMM_WORLD.Get_rank()
   NAME = MPI.Get_processor_name()

   MESSAGE = "Hello from %d of %d on %s." \
     % (RANK, SIZE, NAME)
   print MESSAGE

Programs that run with MPI are executed with ``mpiexec``.
To run ``mpi4py_hello_world.py`` by 3 processes:

::

   $ mpiexec -n 3 python mpi4py_hello_world.py
   Hello from 2 of 3 on asterix.local.
   Hello from 0 of 3 on asterix.local.
   Hello from 1 of 3 on asterix.local.
   $ 

Three Python interpreters are launched.
Each interpreter executes the script,
printing the hello message.

Let us consider again the basic MPI concepts and commands.
``MPI.COMM_WORLD`` is a predefined intracommunicator.
An intracommunicator is a group of processes.
All processes within an intracommunicator have a unique number.
Methods of the intracommunicator ``MPI.COMM_WORLD`` are
``Get_size()``, which returns the number of processes, and
``Get_rank()``, which returns rank of executing process.

Even though every process runs the same script,
the test ``if MPI.COMM_WORLD.Get_rank() == i:``
allows to specify particular code for the *i*-th process.
``MPI.Get_processor_name() ``
returns the name of the calling processor.
A collective communication involves every process
in the intracommunicator.
A broadcast is a collective communication in which
one process sends the same data to all processes,
all processes receive the same data.
In ``mpi4py``, a broadcast is done with the ``bcast`` method.
An example:

::

   $ mpiexec -n 3 python mpi4py_broadcast.py
   0 has data {'pi': 3.1415926535897, 'e': 2.7182818284590}
   1 has data {'pi': 3.1415926535897, 'e': 2.7182818284590}
   2 has data {'pi': 3.1415926535897, 'e': 2.7182818284590}
   $

To pass arguments to the MCA modules, we call ``mpiexec`` as
``mpiexec --mca btl tcp,self -n 3 python mpi4py_broadcast.py``.

The script ``mpi4py_broadcast.py`` below
performs a broadcast of a Python dictionary.

::

   from mpi4py import MPI

   COMM = MPI.COMM_WORLD
   RANK = COMM.Get_rank()

   if(RANK == 0):
       DATA = {'e' : 2.7182818284590451,
               'pi' : 3.1415926535897931 }
   else:
       DATA = None # DATA must be defined

   DATA = COMM.bcast(DATA, root=0)
   print RANK, 'has data', DATA


Bibliography
------------

1. L. Dalcin, R. Paz, and M. Storti.  **MPI for Python.**
   *Journal of Parallel and Distributed Computing*, 65:1108-1115, 2005.

2. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra.
   *MPI - The Complete Reference Volume 1, The MPI Core*.
   Massachusetts Institute of Technology, second edition, 1998.

Exercises
---------

0. Visit ``http://www.mpi-forum.org/docs/``
   and look at the MPI book, available at

   ``http://www.netlib.org/utk/papers/mpi-book/mpi-book.html``.

1. Adjust hello world so that after you type in your name once,
   when prompted by the manager node,
   every node salutes you, using the name you typed in.

2. We measure the wall clock time using ``time mpirun``
   in the broadcasting of an array of doubles.
   To avoid typing in the dimension *n*, either define *n*
   as a constant in the program or redirect the input from a file
   that contains *n*.  For increasing number of processes and *n*,
   investigate how the wall clock time grows.