October 19

Lecture Overview

Machine learning is an area of MCS targeted at classifying and understanding large amounts of data. This page gives a good overview so I won't repeat it. See also wikipedia. The clustering we talked about last time is an example of unsupervised learning.

The scikit learn project provides a large number of python functions to assist with machine learning. For example, instead of writing our own clustering code like I did last time, I could have used one of the clustering modules from this page. Here is an interesting example using k-means (the algorithm we discussed last time) to reduce the number of colors needed to represent an image (and therefore store the image using less space). There are quite a number of interesting examples.

Supervised Learning

In supervised learning, we have as input a collection of data points along with their classification. We want to use this data to train our program or method so that when future points are given to us unclassified, we can use the previous training to predict the classification. For example, consider the Iris Flower Data Set. This data set has already labeled each point by a species so we can use it for training data. Then, when future flowers are picked, we can measure the flower and predict the species.

One simple method of supervised learning is called support vector machines. I don't expect you to understand anything from that Wikipedia page, but in brief SVMs use the training data to pick a line which best separates the various classified points. This page shows this on the iris flower dataset (look at the image in the upper-left). SVM picked three lines which best separated the data (even though many of the red and orange points are on the "wrong" side), that line is the best possible line that puts the most points on each side. Now when future points come in, we can use the colored regions to predict each species. More complicated SVCs can split the training points apart with more than just lines (one of the images has polynomials of degree 3, i.e. the program found the best polynomial of degree three which separated the data).

I also went through this example which uses SVMs to classify hand-written digits.

Exercises