Work Stealing
=============

Work stealing is an alternative to load balancing.
In parallel shared memory computing, we apply the work crew model.

We distinguish between static and dynamic work assignment:

1. Static: before the execution of the program.
   Each worker has *its own queue of jobs* to process.

   :math:`+` Ideal speedup if jobs are evenly distributed,

   :math:`-` if one worker gets all long jobs, then unbalanced.

2. Dynamic: during the execution of the program.
   Workers process *the same queue of jobs.*

   :math:`+` The size of each job is taken into account,

   :math:`-` synchronization overhead may dominate for small jobs
   and when there are many workers.

Tasks are much lighter than threads.
starting and terminating a task is much faster 
than starting and terminating a thread; and
a thread has its own process id and own resources,
whereas a task is typically a small routine.
In scheduling threads on processors,
we distinguish between work sharing and work stealing.
In work sharing, the scheduler attempts to migrate threads
to under-utilized processors in order to distribute the work.
In work stealing, under-utilized processors attempt to steal
threads from other processors.

Work Stealing Simulated by a Julia Program
------------------------------------------

Work stealing is illustrated
as a hybrid between static and dynamic work assignment:

1. Each worker starts with its own queue.

2. An idle worker will work on jobs of other queues.

Main benefit over dynamic work assignment:
synchronization overhead occurs only at the end of the execution.

The setup of the Julia simulation is defined below.

1. As many queues as the number of threads are generated:

   * even indexed queues have small jobs,

   * odd indexed queues have large jobs.

   This generates unbalanced job queues to test the work stealing.

2. The :math:`i`-th worker starts processing the :math:`i`-th job queue.

3. Every queue has an index to the current job.
   In Julia, this index is of type ``Atomic{Int}``,
   for mutual exclusive access.

4. After the :math:`i`-th worker is done with its :math:`i`-th job queue,
   it searches for jobs over all :math:`j`-th queues,
   for :math:`j \not= i`.

The Julia code to make the job queues is listed below.

::

   using Base.Threads

   nt = nthreads()

   nbr = 10 # number of jobs in each queue
   # allocate memory for all job queues
   jobs = [zeros(nbr) for i=1:nt]

   # every worker generates its own job queue
   # even indexed queues have light work loads
   @threads for i=1:nt
       if i % 2 == 0
           jobs[i] = rand((1, 2, 3), nbr)
       else
           jobs[i] = rand((4, 5, 6), nbr)
       end
       println("Worker ", threadid(), " has jobs ",
               jobs[i], " ", sum(jobs[i]))
   end


The output of a run of the program with four threads is shown below.
Each number in the job queue represents the time each job takes.

::

   $ julia -t 4 worksteal.jl
   Worker 1 has jobs [6.0, 6.0, 6.0, ... , 5.0] 53.0
   Worker 3 has jobs [4.0, 4.0, 5.0, ... , 5.0] 48.0
   Worker 4 has jobs [3.0, 2.0, 3.0, ... , 3.0] 24.0
   Worker 2 has jobs [2.0, 2.0, 2.0, ... , 2.0] 14.0

The ``...`` represents omitted numbers for brevity.

The last number of the output is the sum of the times of the jobs.
Workers 2 and 4 has clearly lighter loads,
compared to workers 1 and 3. 

In the code below,
every worker starts processing its own queue:

::

   jobidx = [Atomic{Int}(1) for i=1:nt]
   @threads for i=1:nt
       while true
           myjob = atomic_add!(jobidx[i], 1)
           if myjob > length(jobs[i])
               break
           end
           println("Worker ", threadid(),
                   " spends ", jobs[i][myjob], " seconds",
                   " on job ", myjob, " ...")
           sleep(jobs[i][myjob])
           jobs[i][myjob] = threadid()
       end

Observe the use of the ``Atomic{Int}`` for the indices.
The ``myjob = atomic_add!(jobidx[i], 1)``

* increments the ``jobidx[i]`` after returning its value.

* This statement is executed in a critical section.

Then the code continues, idle threads steal work:

::

   println("Worker ", threadid(), " will steal jobs ...")
   more2steal = true
   while more2steal
       more2steal = false
       for j=1:threadid()-1
           myjob = atomic_add!(jobidx[j], 1)
           if myjob <= length(jobs[j])
               println("Worker ", threadid(),
                       " spends ", jobs[j][myjob], " seconds",
                       " on job ", myjob, " of ", j, " ...")
               sleep(jobs[j][myjob])
               jobs[j][myjob] = threadid()
           end
           more2steal = (myjob < length(jobs[j]))
       end
       for j=threadid()+1:nt # is similar to previous code

An example of a part of an output is shown below.

::

   Worker 4 spends 1.0 seconds on job 7 ...
   Worker 2 will steal jobs ...
   Worker 2 spends 4.0 seconds on job 4 of 1 ...
   Worker 4 spends 3.0 seconds on job 8 ...
   Worker 1 spends 4.0 seconds on job 5 ...
   Worker 4 spends 3.0 seconds on job 9 ...
   Worker 2 spends 4.0 seconds on job 5 of 3 ...
   Worker 3 spends 5.0 seconds on job 6 ...
   Worker 4 spends 3.0 seconds on job 10 ...
   Worker 1 spends 6.0 seconds on job 6 ...
   Worker 3 spends 6.0 seconds on job 7 ...
   Worker 4 will steal jobs ...
   Worker 4 spends 6.0 seconds on job 7 of 1 ...
   Worker 1 spends 4.0 seconds on job 8 ...

Worker 2 is done first, takes job 4 of worker 1.
Worker 1 then continues with job 5.
When worker 4 is done, it takes job 7 of worker 1.
Worker 1 then continues with job 8.

To conclude, we make the following observations:

* Implementing a work crew with work stealing is not
  much more complicated than dynamic load balancing.

* The idle workers start at the first queue
  and then progress linearly, which may be good
  if the first queue contains all important jobs.

* In an alternative work stealing scheme, idle workers would 
  start in the queue of their immediate neighbors.

Multithreading in Python with Numba
-----------------------------------

Numba is an open-source JIT compiler that translates 
a subset of Python and NumPy into fast machine code using LLVM, 
via the llvmlite Python package. 
It offers a range of options for parallelising Python code
for CPUs and GPUs, often with only minor code changes.
Started by Travis Oliphant in 2012, under active development
``https://github.com/numba/numba``.

To use, do ``pip install numba``.
The example below (copied from the wikipedia page) works on Windows.

::

   import numba
   import random

   @numba.jit
   def monte_carlo_pi(n_samples: int) -> float:
       """
       Applies Monte Carlo to estimate pi.
       """
       acc = 0
       for i in range(n_samples):
           x = random.random()
           y = random.random()
           if (x**2 + y**2) < 1.0:
               acc += 1
       return 4.0 * acc / n_samples

   p = monte_carlo_pi(1000000)
   print(p)

Multithreading in Python with Parsl
-----------------------------------

Parsl stands for Parallel Scripting in Python.
Parsl provides an intuitive, pythonic way of parallelizing
codes by annotating ''apps'':  Python functions or external
applications that run concurrently.
Parsl works seamlessly with Jupyter notebooks.
Write once, run anywhere.  From laptops to supercomputers.

To use, do ``pip install parsl``.
The example (copied from the parsl user guide) below was executed on WSL,
Window Subsystem for Linux, Ubuntu 22.04.

::

   from parsl import python_app
   import parsl

   parsl.load()

   # Map function that returns double the input integer
   @python_app
   def app_double(x):
       return x*2

   # Reduce function that returns the sum of a list
   @python_app
   def app_sum(inputs=()):
       return sum(inputs)

   # Create a list of integers
   items = range(0,4)

   # Map phase: apply the double *app* function to each item in list
   mapped_results = []
   for i in items:
       x = app_double(i)
       mapped_results.append(x)

   # Reduce phase: apply the sum *app* function to the set of results
   total = app_sum(inputs=mapped_results)

   print(total.result())

the Intel Threading Building Blocks (TBB)
-----------------------------------------

The Intel TBB is a library 
that helps you leverage multicore performance
*without having to be a threading expert*.
The advantage of Intel TBB is that it works at a higher level 
than raw threads, yet does not require exotic languages or compilers.
The library differs from others in the following ways:

* TBB enables you to specify logical parallelism instead of threads;

* TBB targets threading for performance;

* TBB is compatible with other threading packages;

* TBB emphasizes scalable, data parallel programming;

* TBB relies on generic programming, (e.g.: use of STL in C++).

The code is open source, free to download at 
< http://threadingbuildingblocks.org/>

The TBB task scheduler 
uses *work stealing* for load balancing.

Our first C++ program, similar to our previous *Hello world!* programs,
using TBB is below.
A ``class`` in C++ is a like a ``struct`` in C
for holding data attributes and functions (called methods).

::

   #include "tbb/tbb.h"
   #include <cstdio>
   using namespace tbb;

   class say_hello
   {
      const char* id;
      public: 
         say_hello(const char* s) : id(s) { }
         void operator( ) ( ) const 
         {
            printf("hello from task %s\n",id);
         }
   };

   int main( )
   { 
      task_group tg;
      tg.run(say_hello("1")); // spawn 1st task and return
      tg.run(say_hello("2")); // spawn 2nd task and return 
      tg.wait( );             // wait for tasks to complete
   }

The ``run`` method spawns the task immediately,
but does not block the calling task, so control returns immediately.
To wait for the child tasks to finish,
the classing task calls ``wait``.
Observe the syntactic simplicity of ``task_group``.
When running the code, we see on screen:

::

   $ ./hello_task_group
   hello from task 2
   hello from task 1
   $

using the parallel_for
----------------------

Consider the following problem of raising complex numbers to a large power.

* Input :math:`n \in {\mathbb Z}_{>0}, d \in {\mathbb Z}_{>0}`,
        :math:`{\bf x} \in {\mathbb C}^n`. 

* Output :math:`{\bf y} \in {\mathbb C}^n, y_k = x_k^d`, 
         for :math:`k=1,2,\ldots,n`.

Let us first develop the serial program.

To avoid overflow, we take complex numbers on the unit circle.
In C++, complex numbers are defined as a template class.
To instantiate the class ``complex`` with the type ``double``
we first declare the type ``dcmplx``.
Random complex numbers are generated as
:math:`e^{2 \pi i \theta} = \cos(2 \pi \theta) + i \sin(2 \pi \theta)`,
for random :math:`\theta \in [0,1]`.

::

   #include <complex>
   #include <cstdlib>
   #include <cmath>
   using namespace std;

   typedef complex<double> dcmplx;

   dcmplx random_dcmplx ( void );
   // generates a random complex number on the complex unit circle
   dcmplx random_dcmplx ( void )
   {
      int r = rand();
      double d = ((double) r)/RAND_MAX;
      double e = 2*M_PI*d;
      dcmplx c(cos(e),sin(e));
      return c;
   }

We next define the function to write arrays.
Observe the local declaration ``int i`` in the ``for`` loop,
the scientific formatting, and the methods ``real()`` and ``imag()``.

::

   #include <iostream>
   #include <iomanip>

   void write_numbers ( int n, dcmplx *x ); // writes the array of n doubles in x
   void write_numbers ( int n, dcmplx *x )
   {
      for(int i=0; i<n; i++)
         cout << scientific << setprecision(4)
              << "x[" << i << "] = ( " << x[i].real()
              << " , " << x[i].imag() << ")\n";
   }

Below it the prototype and the definition of the function
to raise an array of ``n`` double complex number to some power.
Because the builtin ``pow`` function applies repeated squaring,
it is too efficient for our purposes and we use a plain loop.

::

   void compute_powers ( int n, dcmplx *x, dcmplx *y, int d );
   // for arrays x and y of length n, on return y[i] equals x[i]**d
   void compute_powers ( int n, dcmplx *x, dcmplx *y, int d )
   {
      for(int i=0; i < n; i++) // y[i] = pow(x[i],d); pow is too efficient
      {
         dcmplx r(1.0,0.0);
         for(int j=0; j < d; j++) r = r*x[i];
         y[i] = r;
      }
   }

Without command line arguments, the main program prompts the user
for the number of elements in the array and for the power.
The three command line arguments are the dimension, the power,
and the verbose level.  If the third parameter is zero, then
no numbers are printed to screen, otherwise, if the third parameter
is one, the powers of the random numbers are shown.
Running the program in silent mode is useful for timing purposes.
Below are some example sessions with the program.

::

   $ /tmp/powers_serial
   how many numbers ? 2
   x[0] = ( -7.4316e-02 , 9.9723e-01)
   x[1] = ( -9.0230e-01 , 4.3111e-01)
   give the power : 3
   x[0] = ( 2.2131e-01 , -9.7520e-01)
   x[1] = ( -2.3152e-01 , 9.7283e-01)

   $ /tmp/powers_serial 2 3 1
   x[0] = ( -7.4316e-02 , 9.9723e-01)
   x[1] = ( -9.0230e-01 , 4.3111e-01)
   x[0] = ( 2.2131e-01 , -9.7520e-01)
   x[1] = ( -2.3152e-01 , 9.7283e-01)
   
   $ time /tmp/powers_serial 1000 1000000 0

   real    0m17.456s
   user    0m17.451s
   sys     0m0.001s

The main program is listed below:

::

   int main ( int argc, char *argv[] )
   {
      int v = 1;    // verbose if > 0
      if(argc > 3) v = atoi(argv[3]);
      int dim;      // get the dimension
      if(argc > 1)
         dim = atoi(argv[1]);
      else
      {
         cout << "how many numbers ? ";
         cin >> dim;
      }
      // fix the seed for comparisons
      srand(20120203); //srand(time(0));
      dcmplx r[dim];
      for(int i=0; i<dim; i++)
         r[i] = random_dcmplx();
      if(v > 0) write_numbers(dim,r);
      int deg;      // get the degree
      if(argc > 1)
         deg = atoi(argv[2]);
      else
      {
         cout << "give the power : ";
         cin >> deg;
      }
      dcmplx s[dim];
      compute_powers(dim,r,s,deg);
      if(v > 0) write_numbers(dim,s);

      return 0;
   }

To speed up the computations, ``parallel_for`` is used.
We first illustrate the speedup that can be obtained
with a parallel version of the code.

::

   $ time /tmp/powers_serial 1000 1000000 0
   real    0m17.456s
   user    0m17.451s
   sys     0m0.001s
   $ time /tmp/powers_tbb 1000 1000000 0
   real    0m1.579s
   user    0m18.540s
   sys     0m0.010s

The speedup: :math:`\displaystyle \frac{17.456}{1.579} = 11.055`
with 12 cores.  The class ``ComputePowers`` is defined below

::

   class ComputePowers
   {
      dcmplx *const c; // numbers on input
      int d;           // degree
      dcmplx *result;  // output
      public:
         ComputePowers(dcmplx x[], int deg, dcmplx y[])
            : c(x), d(deg), result(y) { }
         void operator()
            ( const blocked_range<size_t>& r ) const
         {
            for(size_t i=r.begin(); i!=r.end(); ++i)
            {
               dcmplx z(1.0,0.0);
               for(int j=0; j < d; j++) z = z*c[i];
               result[i] = z;
            }
         }
   };

We next explain the use of ``tbb/blocked_range.h``.
A ``blocked_range`` represents a half open range :math:`[i,j)`
that can be recursively split.

::

   #include "tbb/blocked_range.h"

   template<typename Value> class blocked_range

         void operator()
            ( const blocked_range<size_t>& r ) const
         {
            for(size_t i=r.begin(); i!=r.end(); ++i)
            {

Calling the ``parallel_for`` happens as follows:

::

   #include "tbb/tbb.h"
   #include "tbb/blocked_range.h"
   #include "tbb/parallel_for.h"
   #include "tbb/task_scheduler_init.h"

   using namespace tbb;

Two lines change in the main program:

::

    task_scheduler_init init(task_scheduler_init::automatic);

    parallel_for(blocked_range<size_t>(0,dim),
                 ComputePowers(r,deg,s));

using the parallel_reduce
-------------------------

We consider the summation of integers as an application
of work stealing.  :numref:`figworkstealing`
and :numref:`figworkstealing2` are taken from the
Intel TBB tutorial.

.. _figworkstealing:

.. figure:: ./figworkstealing.png
    :align: center

    An application of work stealing.

.. _figworkstealing2:

.. figure:: ./figworkstealing2.png
    :align: center

    What if no worker is available?

The definition of the class to sum a sequence
of integers is below.

::

   class SumIntegers
   {
      int *data;
      public:
         int sum; 
         SumIntegers ( int *d ) : data(d), sum(0) {}
         void operator()
            ( const blocked_range<size_t>& r )
         {
            int s = sum;  // must accumulate !
            int *d = data;
            size_t end = r.end();
            for(size_t i=r.begin(); i != end; ++i)
               s += d[i];
            sum = s;
         }
         // the splitting constructor
         SumIntegers ( SumIntegers& x, split ) :
            data(x.data), sum(0) {}
         // the join method does the merge
         void join ( const SumIntegers& x ) { sum += x.sum; }
   };

   int ParallelSum ( int *x, size_t n )
   {
      SumIntegers S(x);

      parallel_reduce(blocked_range<size_t>(0,n), S);

      return S.sum;
   }


The main program is below:

::

   int main ( int argc, char *argv[] )
   {
      int n;
      if(argc > 1)
         n = atoi(argv[1]);
      else
      {
         cout << "give n : ";
         cin >> n;
      }
      int *d;
      d = (int*)calloc(n,sizeof(int));
      for(int i=0; i<n; i++) d[i] = i+1;

      task_scheduler_init init
         (task_scheduler_init::automatic);
      int s = ParallelSum(d,n);

      cout << "the sum is " << s
           << " and it should be " << n*(n+1)/2
           << endl;
      return 0;
   }

Bibliography
------------

1. Intel Threading Building Blocks.  Tutorial.
   Available online via <http://www.intel.com>.

2. Robert D. Blumofe and Charles E. Leiserson:
   **Scheduling Multithreaded Computations by Work-Stealing**.
   In the Proceedings of the 35th Annual IEEE Conference on 
   Foundations of Computer Science (FoCS 1994), pages 356-368.

3. Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, 
   Joseph James Gebis, Parry Husbands, Kurt Keutzer,
   David A. Patterson, William Lester Plishker,
   John Shalf, Samuel Webb Williams and Katherine A. Yelick:
   **The Landscape of Parallel Computing Research: A View from Berkeley**.
   Technical Report No. UCB/EECS-2006-183
   EECS Department, University of California, Berkeley,
   December 18, 2006.

Exercises
---------

1. A permanent is similar to a determinant but then without the
   alternating signs.  Develop a task-based parallel program to
   compute the permanent of a 0/1 matrix. 
   Why is work stealing appropriate for this problem?

2. Modify the ``hello world!`` program with TBB
   so that the user is first prompted for a name.
   Two tasks are spawned and they use the given name
   in their greeting.

3. Modify ``powers_tbb.cpp`` so that the :math:`i`-th entry
   is raised to the power :math:`d-i`.  In this way not all
   entries require the same work load.  Run the modified 
   program and compare the speedup to check the performance 
   of the automatic task scheduler.