Basics of MPI ============= Programming distributed memory parallel computers happens through message passing. In this lecture we give basic examples of using the Message Passing Interface, in C or Python. One Single Program Executed by all Nodes ---------------------------------------- A parallel program is a collection of concurrent processes. A process (also called a job or task) is a sequence of instructions. Usually, there is a 1-to-1 map between processes and processors. If there are more processes than processors, then processes are executed in a time sharing environment. We use the :index:`SPMD` model: :index:`Single Program, Multiple Data`. Every node executes the same program. Every node has a unique identification number (id) --- the root node has number zero --- and code can be executed depending on the id. In a :index:`manager/worker model`, the root node is the manager, the other nodes are workers. The letters MPI stands for Message Passing Interface. MPI is a standard specification for interprocess communication for which several implementations exist. When programming in C, we include the header :: #include to use the functionality of MPI. ``Open MPI`` is an open source implementation of all features of MPI-2. In this lecture we use MPI in simple interactive programs, e.g.: as ``mpicc`` and ``mpirun`` are available on laptop computers. Our first parallel program is ``mpi_hello_world``. We use a ``makefile`` to compile, and then run with 3 processes. Instead of ``mpirun -np 3`` we can also use ``mpiexec -n 3``. :: $ make mpi_hello_world mpicc mpi_hello_world.c -o /tmp/mpi_hello_world $ mpirun -np 3 /tmp/mpi_hello_world Hello world from processor 0 out of 3. Hello world from processor 1 out of 3. Hello world from processor 2 out of 3. $ To pass arguments to the :index:`MCA` modules (MCA stands for Modular Component Architecture) we can call ``mpirun -np`` (or ``mpiexec -n``) with the option ``--mca`` such as :: mpirun --mca btl tcp,self -np 4 /tmp/mpi_hello_world MCA modules have direct impact on MPI programs because they allow tunable parameters to be set at run time, such as * which BTL communication device driver to use, * what parameters to pass to that BTL, etc. Note: BTL = Byte Transfer Layer. The code of the program ``mpi_hello_world.c`` is listed below. :: #include #include int main ( int argc, char *argv[] ) { int i,p; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&p); MPI_Comm_rank(MPI_COMM_WORLD,&i); printf("Hello world from processor %d out of %d.\n",i,p); MPI_Finalize(); return 0; } Initialization, Finalization, and the Universe ---------------------------------------------- Let us look at some MPI constructions that are part of any program that uses MPI. Consider the beginning and the end of the program. :: #include int main ( int argc, char *argv[] ) { MPI_Init(&argc,&argv); MPI_Finalize(); return 0; } .. index:: MPI_Init, MPI_COMM_WORLD, MPI_Finalize, MPI_Comm_rank, MPI_Comm_size The ``MPI_Init`` processes the command line arguments. The value of ``argc`` is the number of arguments at the command line and ``argv`` contains the arguments as strings of characters. The first argument, ``argv[0]`` is the name of the program. The cleaning up of the environment is done by ``MPI_Finalize()``. ``MPI_COMM_WORLD`` is a predefined named constant handle to refer to the universe of *p* processors with labels from 0 to :math:`p-1`. The number of processors is returned by ``MPI_Comm_size`` and ``MPI_Comm_rank`` returns the label of a node. For example: :: int i,p; MPI_Comm_size(MPI_COMM_WORLD,&p); MPI_Comm_rank(MPI_COMM_WORLD,&i); Broadcasting Data ----------------- Many parallel programs follow a manager/worker model. In a *broadcast* the same data is sent to all nodes. A broadcast is an example of a :index:`collective communication`. In a *collective communication*, all nodes participate in the communication. As an example, we :index:`broadcast` an integer. Node with id 0 (manager) prompts for an integer. The integer is *broadcasted* over the network and the number is sent to all processors in the universe. Every worker node prints the number to screen. The typical application of broadcasting an integer is the broadcast of the dimension of data before sending the data. The compiling and running of the program goes as follows: :: $ make broadcast_integer mpicc broadcast_integer.c -o /tmp/broadcast_integer $ mpirun -np 3 /tmp/broadcast_integer Type an integer number... 123 Node 1 writes the number n = 123. Node 2 writes the number n = 123. $ .. index:: MPI_Bcast The command ``MPI_Bcast`` executes the broadcast. An example of the ``MPI_Bcast`` command: :: int n; MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WORLD); There are five arguments: 1. the address of the element(s) to broadcast; 2. the number of elements that will be broadcasted; 3. the type of all the elements; 4. a message label; and 5. the universe. The full source listing of the program is shown below. :: #include #include void manager ( int* n ); /* code executed by the manager node 0, * prompts the user for an integer number n */ void worker ( int i, int n ); /* code executed by the i-th worker node, * who will write the integer number n to screen */ int main ( int argc, char *argv[] ) { int myid,numbprocs,n; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numbprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0) manager(&n); MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WORLD); if (myid != 0) worker(myid,n); MPI_Finalize(); return 0; } void manager ( int* n ) { printf("Type an integer number... \n"); scanf("%d",n); } void worker ( int i, int n ) { printf("Node %d writes the number n = %d.\n",i,n); } Moving Data from Manager to Workers ----------------------------------- Often we want to broadcast an array of doubles. The situation before broadcasting the dimension $n$ to all nodes on a 4-processor distributed memory computer is shown at the top left of :numref:`figbroadcastdata`. After broadcasting of the dimension, each node *must* allocate space to hold as many doubles as the dimension. .. _figbroadcastdata: .. figure:: ./figbroadcastdata.png :align: center On the schematic of a a distributed memory 4-processor computer, the top displays the situation before and after the broadcast of the dimension. After the broadcast of the dimension, each worker node allocates space for the array of doubles. The bottom two pictures display the situation before and after the broadcast of the array of doubles. We go through the code step by step. First we write the headers and the subroutine declarations. We include ``stdlib.h`` for :index:`memory allocation`. :: #include #include #include void define_doubles ( int n, double *d ); /* defines the values of the n doubles in d */ void write_doubles ( int myid, int n, double *d ); /* node with id equal to myid writes the n doubles in d */ The main function starts by broadcasting the dimension. :: int main ( int argc, char *argv[] ) { int myid,numbprocs,n; double *data; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numbprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0) { printf("Type the dimension ...\n"); scanf("%d",&n); } MPI_Bcast(&n,1,MPI_INT,0,MPI_COMM_WORLD); The main program continues, allocating memory. It is very important that *every* node performs the memory allocation. :: data = (double*)calloc(n,sizeof(double)); if (myid == 0) define_doubles(n,data); MPI_Bcast(data,n,MPI_DOUBLE,0,MPI_COMM_WORLD); if (myid != 0) write_doubles(myid,n,data); MPI_Finalize(); return 0; It is good programming practice to separate the code that does not involve any MPI activity in subroutines. The two subroutines are defined below. :: void define_doubles ( int n, double *d ) { int i; printf("defining %d doubles ...\n", n); for(i=0; i < n; i++) d[i] = (double)i; } void write_doubles ( int myid, int n, double *d ) { int i; printf("Node %d writes %d doubles : \n", myid,n); for(i=0; i < n; i++) printf("%lf\n",d[i]); } MPI for Python -------------- ``MPI for Python`` provides bindings of MPI for Python, allowing any Python program to exploit multiple processors. It is available at ``http://code.google.com/p/mpi4py``, with manual by Lisandro Dalcin: MPI for Python. The current Release 2.0.0 dates from July 2016. .. index:: point-to-point communication, collective communication The object oriented interface follows closely MPI-2 C++ bindings and supports point-to-point and collective communications of any pickable Python object, as well as numpy arrays and builtin bytes, strings. ``mpi4py`` gives the standard MPI *look and feel* in Python scripts to develop parallel programs. Often, only a small part of the code needs the efficiency of a compiled language. Python handles memory, errors, and user interaction. Our first script is again a *hello world*, shown below. :: from mpi4py import MPI SIZE = MPI.COMM_WORLD.Get_size() RANK = MPI.COMM_WORLD.Get_rank() NAME = MPI.Get_processor_name() MESSAGE = "Hello from %d of %d on %s." \ % (RANK, SIZE, NAME) print MESSAGE Programs that run with MPI are executed with ``mpiexec``. To run ``mpi4py_hello_world.py`` by 3 processes: :: $ mpiexec -n 3 python mpi4py_hello_world.py Hello from 2 of 3 on asterix.local. Hello from 0 of 3 on asterix.local. Hello from 1 of 3 on asterix.local. $ Three Python interpreters are launched. Each interpreter executes the script, printing the hello message. Let us consider again the basic MPI concepts and commands. ``MPI.COMM_WORLD`` is a predefined intracommunicator. An intracommunicator is a group of processes. All processes within an intracommunicator have a unique number. Methods of the intracommunicator ``MPI.COMM_WORLD`` are ``Get_size()``, which returns the number of processes, and ``Get_rank()``, which returns rank of executing process. Even though every process runs the same script, the test ``if MPI.COMM_WORLD.Get_rank() == i:`` allows to specify particular code for the *i*-th process. ``MPI.Get_processor_name() `` returns the name of the calling processor. A collective communication involves every process in the intracommunicator. A broadcast is a collective communication in which one process sends the same data to all processes, all processes receive the same data. In ``mpi4py``, a broadcast is done with the ``bcast`` method. An example: :: $ mpiexec -n 3 python mpi4py_broadcast.py 0 has data {'pi': 3.1415926535897, 'e': 2.7182818284590} 1 has data {'pi': 3.1415926535897, 'e': 2.7182818284590} 2 has data {'pi': 3.1415926535897, 'e': 2.7182818284590} $ To pass arguments to the MCA modules, we call ``mpiexec`` as ``mpiexec --mca btl tcp,self -n 3 python mpi4py_broadcast.py``. The script ``mpi4py_broadcast.py`` below performs a broadcast of a Python dictionary. :: from mpi4py import MPI COMM = MPI.COMM_WORLD RANK = COMM.Get_rank() if(RANK == 0): DATA = {'e' : 2.7182818284590451, 'pi' : 3.1415926535897931 } else: DATA = None # DATA must be defined DATA = COMM.bcast(DATA, root=0) print RANK, 'has data', DATA Bibliography ------------ 1. L. Dalcin, R. Paz, and M. Storti. **MPI for Python.** *Journal of Parallel and Distributed Computing*, 65:1108-1115, 2005. 2. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. *MPI - The Complete Reference Volume 1, The MPI Core*. Massachusetts Institute of Technology, second edition, 1998. Exercises --------- 0. Visit ``http://www.mpi-forum.org/docs/`` and look at the MPI book, available at ``http://www.netlib.org/utk/papers/mpi-book/mpi-book.html``. 1. Adjust hello world so that after you type in your name once, when prompted by the manager node, every node salutes you, using the name you typed in. 2. We measure the wall clock time using ``time mpirun`` in the broadcasting of an array of doubles. To avoid typing in the dimension *n*, either define *n* as a constant in the program or redirect the input from a file that contains *n*. For increasing number of processes and *n*, investigate how the wall clock time grows.