October 16

Lecture Overview

Overview of cluster analysis
Heuristic algorithms
k-means clustering. Wikipedia has a pretty good description, but here is my take.

K-Means Clustering

In k-means clustering, each point we are trying to classify is a tuple of \(d\) real numbers (a point in \(\mathbb{R}^d\)). We use the distance between two points to determine when two points are similar, and we want to classify or cluster the points into \(k\) clusters so that points within the same cluster are similar (or close in distance) and points not in the same cluster are far apart. To formulate this mathematically, to each cluster we assign a centroid which is the average of the points in the cluster (so somehow the middle or center of the cluster). We then want each point to be assigned to the cluster with nearest centroid. (Wikipedia states this as "minimize the within-cluster sum of squares". The sum of squares is the distance, so we are trying to minimize the distance between each point and the centroid, i.e. assign it to the closest centroid.) Note this is somewhat complicated because as points change which cluster they lie in, the centroid will move because the centroid is the average of the points. So changing the classification of a point could then cause other points to be reclassified. Mathematically, we define the optimal as the best over all possible assignments. This image is a pretty good picture of clustering, where the circles (somewhat hard to see, the green one is easiest) are the centroids and the black lines are half way between the centroids.

Implementation

I will say something about it later this semester, but no one knows how to efficiently solve exactly the problem I stated in the previous paragraph. In fact, most people believe that there is no efficient solution to solving the problem. Our current knowledge just lets us obtain an approximation in the following way. We will compute a clustering and then use many steps to refine the clustering for a while until not many points are reclassified in each step. Then we will claim this is a good clustering and output it. We will have no guarantee that this is the best possible clustering (since no one knows how to find it). But intuitively, if when we refine the clustering only a few points change classification, it is a pretty good clustering. This is called a heuristic. I have a visualisation of this below which hopefully helps you understand this further.

So our strategy is as follows:

Randomly start with \(k\) centroids.
Classify each point by assigning it to the nearest centroid.
Now update the centroid to be the actual centroid of the classified points, that is average all the points and make the average the new centroid.
Forget the classification, and reclassify points to the nearest centroid. (Note since the centroids could have moved points could change classification.)
Recompute the centroid of the newly assigned points.
Forget the classification and reclassify.
And so on.

Once only a few points change classification that also implies the centroid won't move very far since the centroid is the average. So our hope is that we are at some good clustering, and in many applications this does indeed work. But there are also situations where this technique does not work which is why clustering is such a big field since we need other techniques for those datasets.

To implement k-means clustering, we use a top-down design. The tasks we need to preform are
- given a list of points and list of centroids, classify the point by the nearest centroid. This has the following sub-tasks.
  - compute distance between two points
  - classify a single point
- compute the centroid of a list of points

Each of these tasks I have identified then become a python function. My implementation of two-dimensional clustering is day22_cluster.py.

Here is some test code day22_test.py which helps visualize the clustering process.

Exercises

Run my clustering code on the Iris Flower Data Set for the "Sepal length" and "Petal lenth". You can download the flower dataset in CSV form here. You will have to write a python program to:

Either read the CSV data from a file or use urllib to download it. Extract the first column (Sepal length) and the third column (Petal length) as your coordinates.
call day22_cluster.kmeans(3, points) like at the very bottom of the test code.
copy in turtle drawing code to draw the points and their classifications. You only need to draw the result of the classification, which is the turtle code at the very bottom of the test.py file. (You will also need the turtle setup code turtle.setworldcoordinates.

You don't have to do it for homework, but note that the CSV also contains the actual subspecies in the last column. We call this training data, because we can cluster like I have you doing above ignoring the subspecies, and then after clustering we can test if our clusters actually match the input subspecies. This allows us to determine "how good" our k-means clustering actually is. Think about how you could change your python code so that after running kmeans you print out how many points were misclassified (Hint: you have the classified_points dictionary and the CSV data and just need to combine it back together.)