Cluster Analysis
================

One goal of data mining is to discover patterns.
Another is to correlate data.
For example, could there be a link between homework grades
and exam performance?

Random Clustered Data
---------------------

Before we look at actual data,
we will first generate our own data,
using random number generators.
The data we generate will be arranged in clusters.
Each cluster is determined by 3 parameters:

1. the centroid is given by coordinates :math:`(x,y)`,

2. the number :math:`n` of points in the cluster,

3. the radius is the largest distance of any point
   in the cluster to the centroid.

.. _figclusdata0:

.. figure:: ./figclusdata0.png
    :align: center

    Three typical, randomly generated clusters.

The script ``makedata.py`` will

1. prompt the user for the number of clusters,
   either then typical clusters are generated,
   or for each cluster, the user is prompted
   for centroid, number of points, and radius;

2. visualize the data via turtle graphics;

3. prompt the user for a file name to save the data.

The function ``cluster`` belows generates random points.

::

   from random import random
   from math import cos, sin, pi, sqrt

   def cluster(nbr, ctr, rad):
       """
       Returns a list of nbr points with
       integer coordinates, in a disk with
       center ctr and radius rad.
       We use the polar representation of the points.
       """
       points = []
       points.append(ctr)
       for _ in range(1, nbr):
           rpt = rad*random()  # radius of the point
           apt = 2*pi*random() # angle of the point
           xpt = ctr[0] + round(rpt*cos(apt))
           ypt = ctr[1] + round(rpt*sin(apt))
           points.append((xpt, ypt))
       return points

We generate typical data with the following function.

::

   from random import randint

   def typical_data():
       """
       Generates random numbers for (nbr, ctr, rad).
       """
       data = []
       ndc = int(input("Give the number of clusters : "))
       for i in range(ndc):
           nbr = randint(10, 100)
           ctr = (randint(-200, 200), randint(-200, 200))
           rad = randint(10, 50)
           data.append(cluster(nbr, ctr, rad))
       return data

To visualize the data, we need to compute
for each cluster:
the centroid for each list; and
the largest distance of each point
in the list to the centroid.
The script ``makedata.py`` provides functions
to compute centroids and radii, for use in the
implementation of the K-means algorithm.
The last two lines of the script ``makedata.py``:

::

   if __name__ == "__main__":
       main()

allow to use the script ``makedata.py`` as a module.

With a list comprehension, we can give a one line definition
of the function ``centroids``.

::

   def centroids(data):
       """
       Given in data a list of lists of points,
       returns the list of centroids.
       """
       return [centroid(points) for points in data]

The specification for the ``centroid`` is below:

::

   def centroid(points):
       """
       Given a list points of tuples (x, y),
       returns the centroid of the points.
       """

Simple Turtle Graphics
----------------------

For our visualization, we use ``turtle``,
a builtin module in Python,
which is simple to use.

The turtle commands we will use are in :numref:`tabturtlecommands`.

.. _tabturtlecommands:

.. table:: turtle commands

   +--------------------------+-------------------------------------+
   | turtle command           | what the command does               |
   +==========================+=====================================+
   | ``import turtle``        | import the module                   |
   +--------------------------+-------------------------------------+
   | ``turtle.tracer(False)`` | for faster drawing                  | 
   +--------------------------+-------------------------------------+
   | ``turtle.up()``          | put the pen up                      |
   +--------------------------+-------------------------------------+
   | ``turtle.down()``        | put the pen down                    |
   +--------------------------+-------------------------------------+
   | ``turtle.goto(x,y)``     | goto position ``(x,y)``             |
   +--------------------------+-------------------------------------+
   | ``turtle.circle(r)``     | draws a circle of radius ``r``      |
   +--------------------------+-------------------------------------+
   | ``turtle.color(p,f)``    | sets pen ``p`` and fill color ``f`` |
   +--------------------------+-------------------------------------+

The visualization of the data points with turtle
is defined by the following function.

::

   def show_data(data):
       """
       For each point in data, draws
       a circle of radius 2 pixels,
       centered at the data point.
       """
       tracer(False)
       for points in data:
           for pnt in points:
               turtle.up()
               turtle.goto(pnt[0], pnt[1]-2)
               turtle.down()
               turtle.circle(2)
       turtle.up()
       turtle.goto(0, 0)

The centroids and their radii are visualized by
the following function.

::

   def show_centroids(ctr, rad):
       """
       Given on input in ctr a list of
       centroids and in rad the corresponding
       radii, draws the disks.
       """
       for k in range(len(ctr)):
           cpt = ctr[k]
           rd2 = rad[k]+2
           turtle.up()
           turtle.goto(cpt[0], cpt[1]-rd2)
           turtle.down()
           turtle.circle(rd2)
       turtle.up()

We will save the randomly generated data to file.
This data will then be read by the script which does
the cluster analysis.
The saving of the data to file is done by the 
function ``save_to_file`` below.

::

   def save_to_file(data):
       """
       Prompts the user for a file name
       and writes the data to file.
       """
       name = input("give a file name : ")
       datafile = open(name, 'w')
       datafile.write(str(data))
       datafile.close()

Observe the string conversion with ``str``.
We save the string representations of the data to file.

Reading Data From File
----------------------

We start a new script ``kmeans.py``
which will do the following:

1. prompt the user for a file name,

2. read in the data, flatten and shuffle,

3. run three iterations of the K-means Algorithm,
   showing the results with turtle.

To make our script user friendly, we have to be careful about
the user input.  The user may give a name which does not correspond
to the right file, or the file may contain wrong data.
With the exception handling mechanisms we can handle those errors.

We read files typically line after line,
with a repeat - until loop, shown in :numref:`figflowrepeat`.

.. _figflowrepeat:

.. figure:: ./figflowrepeat.png
    :align: center

    Flowchart of a repeat - until loop.

The pseudo code for a repeat - until loop is below:

::

   repeat
      try to open a file with given name
   until success

This can be coded in Python as

::

   while True:
       try:
           file = open(name, 'r')
       except:
           print "please try again ..."

This repeat - until code,
with the exception handler in the try - except block,
is applied in the function below.

::

   def read_data():
       """
       Prompts the user for a name of a file,
       tries to open the file and to read a list.
       If all goes well, the list is returned.
       """
       data = []
       while True:
           name = input("give a file name : ")
           try:
               datafile = open(name, 'r')
               data = read_list(datafile)
               datafile.close()
               return data
           except IOError:
               print('could not open file ' + name)
               print('please try again')
       return data

The data is stored in a list on file.
The function below processes the string read from file
and returns the data in the list.

::

   def read_list(file):
       """
       Attempts to read a list from file
       and will return this list.
       Returns the empty list if exceptions occur.
       """
       try:
           line = file.readline()
           data = eval(line)
           if isinstance(data, list):
               return data
           else:
               print(line)
               print("no list on file...")
               return []
       except IOError:
           print("exception occurred during reading...")
           return []

The K-Means Algorithm
---------------------

The input statement with the problem specification
for the k-means algorithm is as follows:

.. math::

   \begin{array}{ll}
   Input: & L \mbox{ a list of points;} \\
          & k \mbox{ the number of centroids.}
   \end{array}

The algorithm proceeds as follows:

1. :math:`C :=` a random selection of :math:`k` points of :math:`L`;

2. repeat

   1.  for each :math:`p` in :math:`L`:
       if :math:`p` is closest to :math:`C_i`,
       then put :math:`p` in the *i*-th cluster;

   2. recompute the centroids for each of the *k* clusters;

   3. show the clusters

   until the clusters are stable.

The running of the algorithm is illustrated in
a sequence of pictures, starting with the input in :numref:`figclusdata1`
and going through a sequence of three steps,
in :numref:`figclusdata2`, :numref:`figclusdata3`, and :numref:`figclusdata4`.

.. _figclusdata1:

.. figure:: ./figclusdata1.png
    :align: center

    Input of the k-means algorithm.

.. _figclusdata2:

.. figure:: ./figclusdata2.png
    :align: center

    First run of the k-means algorithm,
    the clusters are shown in red.

.. _figclusdata3:

.. figure:: ./figclusdata3.png
    :align: center

    Second run of the k-means algorithm,
    the clusters are shown in green.

.. _figclusdata4:

.. figure:: ./figclusdata4.png
    :align: center

    Third run of the k-means algorithm,
    the clusters are shown in blue.
   
List Comprehensions
-------------------

In the classification of the data we apply list comprehensions.

The function ``distance(p,q)`` returns the
Euclidean distance between the points p and q.

::

   def classify_point(ctr, pnt):
       """
       Returns the index of the point q in ctr,
       closest to ctr.
       """
       dis = [distance(pnt, q) for q in ctr]
       minval = min(dis)
       return dis.index(minval)

For any list ``L`` and element ``e``,
we have ``k = L.index(e): L[k] == e``.

Now we can formulate the code for the main ``kmeans`` function.
To select the first centroids:
``random.sample(L,k)`` returns ``k`` unique random
elements from ``L``.

::

   def kmeans(L, k):
       """
       Applies the K-means algorithm 3 times to divide
       the data in L into k clusters.
       """
       C0 = sample(L, k) # k unique random elements from L
       K0 = classify_data(L, k, C0)
       C1 = centroids(K0)
       R1 = radii(K0, C1)
       turtle.color('red', 'red')
       show_centroids(C1, R1)
       input("hit enter to continue ...")

In the second iteration, we classify the points
with the newly calculated centroids:

::

    # code for kmeans continued
       K1 = classify_data(L, k, C1)
       C2 = centroids(K1)
       R2 = radii(K1, C2)
       turtle.color('green', 'green')
       show_centroids(C2, R2)
       input("hit enter to continue ...")

Then the third run continues

::

    # code for kmeans continued
       K2 = classify_data(L, k, C2)
       C3 = centroids(K2)
       R3 = radii(K2, C3)
       turtle.color('blue', 'blue')
       show_centroids(C3, R3)

Data From The Web
-----------------

We can process downloaded data, downloaded onto a file,
but we can also read data from the web with a Python script.

The US government collects lots of data.
The US Geological Survey provides data sets
at <http://earthquake.usgs.gov>
about recent earthquakes.

In our application, we are interested in raw data.
We can download comma separated values files
with Python scripts.
With ``urlopen.request`` of ``urllib``:
web page :math:`\sim` file.
We download a comma separated values (csv) file:

::

   from urllib.request import urlopen

   SITE = "http://earthquake.usgs.gov"
   URL = SITE + "/earthquakes/feed/v1.0/summary/all_month.csv"

   INFILE = urlopen(URL)
   OUTFILE = open('quakes.csv', 'w')
   while True:
       LINE = INFILE.readline()
       if LINE == b'': # b stands for bytes
           break
       OUTFILE.write(LINE.decode()) # bytes to string
   INFILE.close()
   OUTFILE.close()

The start of ``quakes.csv`` looks like

::

   time,latitude,longitude,depth,mag,magType, ...
   2016-01-24T23:48:09.720Z,33.3605,-116.386,11.27,1.04,ml, ...
   2016-01-24T22:53:02.000Z,59.3473,-153.1837,100.7,2.4,ml, ...
   2016-01-24T22:52:58.460Z,36.6545,-98.4441,12.36,3.1,mb_lg, ...
   2016-01-24T22:34:14.280Z,33.5026667,-116.497,13.23,0.98,ml, ...
   ...

The coordinates of interest are latitude and longitude.
The plot of the coordinates of all quakes of magnitude of 5.0
or greater is in :numref:`figearth`.

.. _figearth:

.. figure:: ./figearth.png
    :align: center

    All earth quakes of magnitude 5.0 or greater.

The script to process the data in the comma separated valued file
is below.

::

   from makedata import show_data
   DATAFILE = "quakes.csv"

   def select_data(mag):
       """
       Opens the file and returns list of
       longitudes and latitudes of quakes of
       magnitude larger than or equal to mag.
       Because latitude runs vertically,
       latitude is y and longitude is x.
       """
       result = []
       file = open(DATAFILE, 'r')
       line = file.readline()
       while True:
           line = file.readline()
           if line == '':
               break
           data = line.split(',')
           try:
               dtm = float(data[4])
           except:
               dtm = 0.0
           if dtm >= mag:
               pnt = (float(data[2]), float(data[1]))
               result.append(pnt)
       return result

For visualization with turtle we round 
and convert the coordinates to integer values.

Exercises
---------

1. Instead of turtle graphics, use a Canvas widget
   of Tkinter to show the data points.

2. One stop criterion for the K-means algorithm is
   to check whether no points change list after
   classifying from one iteration to the next.
   Give Python code of an implementation of the
   K-means algorithm that uses this stop criterion.
   You may ignore the visualization part.

3. Apply the K-means algorithm to ``quakes.csv``.
   Experiment with various numbers of clusters.

4. With the method ``bgpic`` of turtle we can set
   the background picture to a world map.
   Download a ``"worldmap.gif"`` file and rescale
   so the data from quakes.csv is mapped right.