October 16

Lecture Overview

K-Means Clustering

In k-means clustering, each point we are trying to classify is a tuple of \(d\) real numbers (a point in \(\mathbb{R}^d\)). We use the distance between two points to determine when two points are similar, and we want to classify or cluster the points into \(k\) clusters so that points within the same cluster are similar (or close in distance) and points not in the same cluster are far apart. To formulate this mathematically, to each cluster we assign a centroid which is the average of the points in the cluster (so somehow the middle or center of the cluster). We then want each point to be assigned to the cluster with nearest centroid. (Wikipedia states this as "minimize the within-cluster sum of squares". The sum of squares is the distance, so we are trying to minimize the distance between each point and the centroid, i.e. assign it to the closest centroid.) Note this is somewhat complicated because as points change which cluster they lie in, the centroid will move because the centroid is the average of the points. So changing the classification of a point could then cause other points to be reclassified. Mathematically, we define the optimal as the best over all possible assignments. This image is a pretty good picture of clustering, where the circles (somewhat hard to see, the green one is easiest) are the centroids and the black lines are half way between the centroids.

Implementation

I will say something about it later this semester, but no one knows how to efficiently solve exactly the problem I stated in the previous paragraph. In fact, most people believe that there is no efficient solution to solving the problem. Our current knowledge just lets us obtain an approximation in the following way. We will compute a clustering and then use many steps to refine the clustering for a while until not many points are reclassified in each step. Then we will claim this is a good clustering and output it. We will have no guarantee that this is the best possible clustering (since no one knows how to find it). But intuitively, if when we refine the clustering only a few points change classification, it is a pretty good clustering. This is called a heuristic. I have a visualisation of this below which hopefully helps you understand this further.

So our strategy is as follows:

Once only a few points change classification that also implies the centroid won't move very far since the centroid is the average. So our hope is that we are at some good clustering, and in many applications this does indeed work. But there are also situations where this technique does not work which is why clustering is such a big field since we need other techniques for those datasets.

Each of these tasks I have identified then become a python function. My implementation of two-dimensional clustering is day22_cluster.py.

Here is some test code day22_test.py which helps visualize the clustering process.

Exercises

Run my clustering code on the Iris Flower Data Set for the "Sepal length" and "Petal lenth". You can download the flower dataset in CSV form here. You will have to write a python program to:

You don't have to do it for homework, but note that the CSV also contains the actual subspecies in the last column. We call this training data, because we can cluster like I have you doing above ignoring the subspecies, and then after clustering we can test if our clusters actually match the input subspecies. This allows us to determine "how good" our k-means clustering actually is. Think about how you could change your python code so that after running kmeans you print out how many points were misclassified (Hint: you have the classified_points dictionary and the CSV data and just need to combine it back together.)