Disk Based Parallelism
======================

We covered distributed memory and shared memory parallelism
as two separate technologies.  Adding a software layer on top
of message passing, as the software roomy does, provides the
parallel programmer with a vast random access memory.

The Disk is the new RAM
-----------------------

When is random access memory not large enough?
For the context of this lecture, we consider
big data applications.
In applications with huge volumes of data,
the bottleneck is not the number of arithmetical operations.
*The disk is the new RAM.*
``Roomy`` is a software library that allows to treat disk 
as Random Access Memory.
An application of ``Roomy`` concerns the determination 
of the minimal number of moves to solve Rubik's cube.

In *parallel disk-based computation*,
instead of Random Access Memory, we
use disks as the main working memory of a computation.
This gives much more space for the same price.
There are at least two performance issues
for which we give solutions.

1. Bandwidth: the bandwidth of a disk is roughly 50 times 
   less than RAM (100 MB/s versus 5 GB/s).
   The solution is to use many disks in parallel.

2. Latency: even worse, the latency of disk is many orders 
   of magnitude worse than RAM.
   The solution to avoid latency penalties is to use streaming access.

As an example we take the pancake sorting problem.
Given a stack of *n* numbered pancakes,
at the left of :numref:`figpancakesort`.
A spatula can reverse the order of the top *k* pancakes
for :math:`2 \leq k \leq n`.

.. _figpancakesort:

.. figure:: ./figpancakesort.png
    :align: center

    Sorting a pile of pancakes with flips.

The sort happened with 4 flips: 

1. k = 2

2. k = 4

3. k = 3

4. k = 2

The question we ask it the following:
How many flips (prefix reversals) are sufficient to sort?

To understand this question we introduce the
pancake sorting graph.
We generate all permutations using the flips,
as in :numref:`fig4permutations`.

.. _fig4permutations:

.. figure:: ./fig4permutations.png
    :align: center

    Permutations generated by flips.

There are 3 stacks that require 1 flip.
4! = 24 permutations, :math:`24 = 1 + 3 + 6 + 11 + 3`,
at most 4 flips are needed for a stack of 4 pancakes.

A breadth first search, running the program ``pancake``,
its output is displayed below:

::

   $ mpiexec -np 2 ./pancake
   Sun Mar  4 13:24:44 2012: BEGINNING 11 PANCAKE BFS
   Sun Mar  4 13:24:44 2012: Initial one-bit RoomyArray constructed
   Sun Mar  4 13:24:44 2012: Level 0 done: 1 elements
   Sun Mar  4 13:24:44 2012: Level 1 done: 10 elements
   Sun Mar  4 13:24:44 2012: Level 2 done: 90 elements
   Sun Mar  4 13:24:44 2012: Level 3 done: 809 elements
   Sun Mar  4 13:24:44 2012: Level 4 done: 6429 elements
   Sun Mar  4 13:24:44 2012: Level 5 done: 43891 elements
   Sun Mar  4 13:24:45 2012: Level 6 done: 252737 elements
   Sun Mar  4 13:24:49 2012: Level 7 done: 1174766 elements
   Sun Mar  4 13:25:04 2012: Level 8 done: 4126515 elements
   Sun Mar  4 13:25:45 2012: Level 9 done: 9981073 elements
   Sun Mar  4 13:26:48 2012: Level 10 done: 14250471 elements
   Sun Mar  4 13:27:37 2012: Level 11 done: 9123648 elements
   Sun Mar  4 13:27:53 2012: Level 12 done: 956354 elements
   Sun Mar  4 13:27:54 2012: Level 13 done: 6 elements
   Sun Mar  4 13:27:54 2012: Level 14 done: 0 elements
   Sun Mar  4 13:27:54 2012: ONE-BIT PANCAKE BFS DONE
   Sun Mar  4 13:27:54 2012: Elapsed time: 3 m, 10 s, 536 ms, 720 us

How many flips to sort a stack of 11 pancakes?
:math:`11 \times 10 \times 9 \times 8 \times 7 \times 6 \times 
5 \times 4 \times 3 \times 2 = 39,916,800`.

.. _tabdatabfs:

.. table:: Number of flips to sort 11 pancakes.

   +--------+----------+
   | #moves | #stacks  |
   +========+==========+
   | 0      |        1 |
   +--------+----------+
   | 1      |       10 |
   +--------+----------+
   | 2      |       90 |
   +--------+----------+
   | 3      |      809 |
   +--------+----------+
   | 4      |     6429 |
   +--------+----------+
   | 5      |    43891 |
   +--------+----------+
   | 6      |   252737 |
   +--------+----------+
   | 7      |  1174766 |
   +--------+----------+
   | 8      |  4126515 |
   +--------+----------+
   | 9      |  9981073 |
   +--------+----------+
   | 10     | 14250471 |
   +--------+----------+
   | 11     |  9123648 |
   +--------+----------+
   | 12     |   956354 |
   +--------+----------+
   | 13     |        6 |
   +--------+----------+
   | total  | 39916800 |
   +--------+----------+

So we find that
at most 13 moves are needed for a stack of 11 pancakes.

Roomy: A System for Space Limited Computations
----------------------------------------------

``Roomy`` is 
a new programming model that extends a programming language
with transparent disk-based computing supports;
It an open source C/C++ library implementing this new programming
language extension; available at ``sourceforge``;
written by Daniel Kunkle in the lab of Gene Cooperman.

Roomy has been applied to problems in computational group theory, and
large enumerations, e.g.: Rubik's cube can be solved in 26 moves or less.

Example: 50-node cluster with 200GB disk space per computer
gives 10TB per computer.  Bandwidth of one disk: 100MB/s,
bandwidth of 50 disks in parallel: 5GB/s
So, the 10TB disk space is considered as RAM.

Some caveats remain.
Disk latency remains limiting.  Use old-fashioned RAM as cache.
If disk is the new RAM, RAM is the new cache.
Networks must restructured to emphasize local access
over network access.

The Roomy programming programming model

1. provides basic data structures: arrays, lists, and hash tables;
   transparently distributes data structures across many disks and

2. performs operations on that data in parallel;

3. immediately process streaming access operators;

4. delays processing random operators until they can be performed
   efficiently in batch.

For example: collecting and sorting updates to an array.
We introduce some programming concepts of ``Roomy``.
The first programming construct we consider is ``map``.

::

   RoomyArray* ra;
   RoomyList* rl;
   // Function to map over ra.
   void mapFunc ( uint64 i, void* val )
   {
      RoomyListadd(rl,val);
   }

   int main ( int argc, char **argv )
   {
      Roomy_init(&argc,&argv);
      ra = RoomyArray_makeBytes("array",sizeof(uint64),100);
      rl = RoomyList_make("list",sizeof(uint64));
      RoomyArray_map(ra,mapFunc); // execute map
      RoomyList_sync(rl); // synchronize list for delayed ads
      Roomy_finalize();
   }

The next programming construct is ``reduce``.
Computing for example the sum of squares of the elements 
in a ``RoomyList``:

::

   RoomyList* rl;  // elements of type int

   // add square of an element to sum
   void mergeElt ( int* sum, int* element )
   {
      *sum += (*e) * (*e);
   }

   // compute sum of two partial answers
   void mergeResults ( int * sum1, int* sum2 )
   {
       *sum1 += *sum2;
   }
   int sum = 0;
   RoomyList_reduce(rl,&sum,sizeof(int),mergeElt,mergeResults);

Another programming construct is the predicate.
Predicates count the number of elements in a data structure
that satisfy a Boolean function.

::

   RoomyList* rl;

   // predicate: return 1 if element is greater than 42
   uint8 predFunc ( int* val )
   {
      return (*val > 42) ? 1 : 0;
   }

   RoomyList_attachPredicate(rl,predFunc);

   uint64 gt42 = RoomyList predicateCount(rl,predFunc);

The next programming construct we illustrate is
``permutation multiplication``.
For permutations X, Y, Z on N elements length N do Z = X*Y as
``for i=0 to N-1: Z[i] = Y[X[i]]``.

::

   RoomyArray *X, *Y, *Z;
   // access X[i]
   void accessX ( unit64 i, uint64* x_i)
   {
      RoomyArray_access(Y, *x_i,&i, accessY);
   }
   // access Y[X[i]]
   void accessY ( uint64 x_i, uint64* y_x_i, uint64* i )
   {
      RoomyArray_update(Z,*i,y_x_i,setZ);
   }
   // set Z[i] = Y[X[i]]
   void setZ ( uint64, i, uint64* z_i, uint64* y_x_i, uint64* z_i_NEW )
   {
      *z_i_NEW = *y_x_i;
   }
   RoomyArray_map(X,accessX); // access X[i]
   RoomyArray_sync(Y);        // access Y[X[i]]
   RoomyArray_sync(Z);        // set Z[i] = Y[X[i]]

``Roomy`` supports set operations as programming construct.
Convert list to set:

::

   RoomyList *A; // can contain duplicates
   RoomyList_removeDupes(A); // is a set

Union of two sets :math:`A \cup B`:

::

   RoomyList *A, *B;
   RoomyList_addAll(A,B);
   RoomyList_removeDupes(A);

Difference of two sets :math:`A \setminus B`:

::

   RoomyList *A, *B;
   RoomyList_removeAll(A,B);

The intersection :math:`A \cap B` is implemented 
as :math:`(A \cup B) \setminus (A \setminus B) \setminus (B \setminus A)`.

The last programming construct we consider is
``breadth-first search``.  Initializing the search:

::

   // Lists of all elements, current, and next level
   RoomyList* all = RoomyList_make("allLev",eltSize);
   RoomyList* cur = RoomyList_make("lev0",eltSize);
   RoomyList* next = Roomy_list_Make("lev1",eltSize);

   // Function to produce next level from current
   void genNext ( T elt )
   {
      // user defined code to compute neighbors ... 
      for nbr in neighbors
         RoomyList_add(next,nbr);
   }
   // add start element
   RoomyList_add(all,startElt);
   RoomyList_add(cur,startElt);

Performing the search is shown below:

::

   // generate levels until no new states are found
   while(RoomyList_size(cur))
   {  // generate next level from current
      RoomyList_map(cur,genNext);
      RoomyList_sync(next);
      // detect duplicates within next level
      RoomyList_removeDupes(next);
      // detect duplicates from previous levels
      RoomyList_removeAll(next,all);
      // record new elements
      RoomyList_addAll(all,next);
      // rotate levels
      RoomyList_destroy(cur);
      cur = next;
      newt = RoomyList_make(levName,eltSize);
   }

Hadoop and the Map/Reduce model
-------------------------------

To process big data with parallel and distributed computing,
we could use MySQL which scales well.
Hadoop has become the de facto standard for processing big data.

Hadoop is used by Facebook to store photos, by LinkedIn to
generate recommendations, and by Amazon to generate search indices.
Hadoop works by connecting many different computers,
hiding the complexity from the user: work with one giant computer.
Hadoop uses a model called Map/Reduce.

The goal of the Map/Reduce model
is to help with writing efficient parallel programs.
Common data processing tasks: filtering, merging, aggregating data
and many (not all) machine learning algorithms fit into Map/Reduce.
There are two steps:

* Map:

  Tasks read in input data as a set of records, process the records, 
  and send groups of similar records to reducers. 
  The mapper extracts a key from each input record. 
  Hadoop will then route all records with the same key to the same reducer.

* Reduce:

  Tasks read in a set of related records, process the records, 
  and write the results. 
  The reducer iterates through all results for the same key, 
  processing the data and writing out the results.

The strength of this model is that 
the map and reduce steps run well in parallel.

Consider for example the prediction of user behavior.
The question we ask is:
How likely is a user to purchase an item from a website?

Suppose we have already computed (maybe using Map/Reduce)
a set of variables describing each user: 
most common locations, 
the number of pages viewed, and 
the number of purchases made in the past.
One method to calculate a forecast is to use random forests.
Random forests work by calculating a set of regression trees
and then averaging them together to create a single model. 
It can be time consuming to fit the random trees to the data, 
but each new tree can be calculated independently.
One way to tackle this problem is to use a set of map tasks to 
generate random trees, and then send the models
to a single reducer task to average the results and produce the model.

Bibliography
------------

1. Daniel Kunkle.
   Roomy: A C/C++ library for parallel disk-based computation, 2010.
   <http://roomy.sourceforge.net/>.

2. Daniel Kunkle and Gene Cooperman.
   Harnessing parallel disks to solve Rubik's cube.
   *Journal of Symbolic Computation* 44:872-890, 2009.

3. Daniel Kunkle and Gene Cooperman.
   Solving Rubik's Cube: Disk is the new RAM.
   *Communications of the ACM* 51(4):31-33, 2008.

4. Garry Turkington.
   Hadoop Beginner's Guide.
   <www.it-ebooks.info>.

Exercises
---------

1. Watch the YouTube google tech talk of Gene Cooperman
   on the application of disk parallelism to Rubik's cube.

2. Read the paper of Daniel Kunkle and Gene Cooperman
   that was published in the Journal of Symbolic Computation,
   see the bibliography.