Since I plan to spend a fair amount of time on this blog talking about machine learning, I thought it would make sense to give a basic introduction to what in the world machine learning is. From the outside, people seem to think it’s some kind of magic. In this post, I will give a brief introduction to the principles in machine learning, highlighting the way machine learning researchers frame problems and how the algorithms they develop work to solve these problems.

The field is separated into two main categories: supervised learning and unsupervised learning.

Supervised learning is when we are given input vectors (X’s) and associated results (Y’s), which could be categories (classification problems) or numbers (regression problems). Given this information, we then seek to develop a model whereby, given a new vector (X), we can determine the associated result (Y).

Sorry if that was too far out there. Maybe we’re given a vector of a user’s preferences in music and we’re trying to classify him as a fan of “rock” or “rap.” Maybe we’re trying to look at some market variables and tell whether it’s going to go up or down.

In unsupervised problems, we are only given the input vectors (X). Our goal, then, is to “cluster” (or find logical groupings) of these vectors subject to some constraints. That’s not as important to what I do, so I’m going to leave it for another time.

**Supervised learning**

A vector is simply a list of numbers. For simplicity’s sake, we’ll start with a problem where the input vectors (X) are two-dimensional, since these vectors can easily be visualized on an x-y plane. However, it is important to remember that these algorithms can work on input vectors of any dimension.

Consider the following “toy problem.” We are given two-dimensional input vectors (X), and associated Y values which can be either “square” or “triangle.” Then, given a new two-dimensional point (X), depicted below as a green circle, we want to decide whether it corresponds to “square” or “triangle”:

This image demonstrates how perhaps the simplest of supervised learning algorithms, known as the k-nearest neighbor algorithm (or *KNN*) works. KNN simply looks at the new point, and finds the closest points in the training set (the nearest neighbors) in order to decide how to classify the new point. The inner circle demonstrates that, when we use the closest 3 points (k = 3) the algorithm will predict “triangle.” However, the outer circle shows that, when we select k = 5, it will predict “square.”

Naturally, since the Euclidean distance function used by KNN can work with vectors of any dimension, this principle can be applied to input vectors of any size (although they wouldn’t be quite as easy to visualize).

This is the fundamental problem of supervised learning: we are given examples (input vectors X and their associated outcomes Y), and we look for a way to train a model to make predictions about new points.

Although KNN does it in a fairly simple way, other algorithms tackle the problem from other directions.

Another example is an algorithm called a linear support vector machine. This algorithm relies on the intuition of maximum margin to solve supervised learning problems. Consider the following 2-dimensional example, here training an algorithm to separate between black and white Y values:

The linear SVM draws a line between the two sets in such a way that it accomplishes two things:

- The line is as far as possible from the closest training points of each class (maximum margin); and
- As few as possible points are on the “wrong side” of the line.

The training points along the dotted line are called the “support vectors”—the line is drawn so as to maximize its distance from these support vectors.

The linear SVM has one serious flaw: in order for it to work, the two classes must be separable by a simple line. That doesn’t occur very often. In the above KNN demonstration, for example, the squares weren’t linearly separable from the triangles. In that case, the linear SVM would fail miserably.

**The Kernel Trick**

The way we get around this problem is by using the kernel trick, which I believe is one of the most beautiful pieces of mathematics I’ve ever seen.

The kernel trick works as follows. We have our original points (X’s), which are 2-dimensional and sit in the x-y plane. As it turns out, there are functions (called kernel functions) that take those original 2-dimensional points and project them (according to some rules) into a much higher dimensional space. In fact, using the most popular kernel (called the Gaussian radial basis function kernel), it takes points of any dimension and projects them into an infinite-dimensional space (believe it or not, this is possible). Of course, computers can’t use infinite dimensional points. But, these kernel functions have special properties that allow us to compute various operations (mainly the “dot” product) on these infinite-dimensional vectors without ever actually calculating the infinite-dimensional vector itself. Since SVMs use dot-products for doing their analysis, they can essentially use a kernel function to project the 2-dimensional points into an infinite-dimensional space in which the two classes are linearly separable, and train the algorithm based on where these points lie in that infinite-dimensional space. When an SVM is used this way, it is called a Kernel Support Vector Machine.

This concept can be confusing at first. To get an idea of how this works in practice, you can play around with this SVM applet, which will allow you to use different data sets and different kernels and see how the SVM reacts.

KSVMs are truly revolutionary. They can accomplish a variety of tasks without any tweaking at all, including handwriting, speech and image recognition. I also have applied KSVMs to some problems in the financial sector, although their main selling point is that they can make good generalizations with small amounts of training examples. Since we have plenty of training examples, KSVMs may not be the right choice for us.

**Conclusions and further reading**

For further reading, I suggest taking a look at Andrew Moore’s tutorials, which I have found to be very helpful. Andrew Moore is a well-known AI researcher from CMU. Mainly, I suggest taking a look at his tutorials on Decision Trees, Gaussian Mixture Models, K-Means Clustering and Support Vector Machines. For a broad look at the field, his Intro to AI tutorial might be helpful.

Hopefully I’ve introduced the basic concept of machine learning. We come up with training examples, which are just a list of numbers tagged with a category/outcome. Then, given a new list of numbers, we are trying to classify it or predict the outcome associated with it. Thus, the data we feed these algorithms is of paramount importance. The real key is to find a group of data that make our outcomes separable in the high-dimensional space in which these training examples live—then, we simply have to write an algorithm that can learn how to separate between them.

Naturally, this is a bit of a simplification. These algorithms rarely work “out of the box,” and it requires a good understanding of all the internals to figure out which algorithm to use, and how specifically to implement it such as to maximize the chances of success. The programmer has to determine which kernel to use (if any), how to use the data to make it as useful as possible to the algorithm, and how to fine-tune the parameters that each of these algorithms take.

%d bloggers like this:

This is a good blog post, but I think it would be better if you defined KNN before you used it.

Comment by Peter — December 31, 2009 @ 6:48 pm

Updated. Thanks!

Comment by awwthor — January 1, 2010 @ 7:08 pm

great post, it would be nice if there was a quick intro on feature extraction, since it’s so important to ML.

KNN = k-nearest node

I recommend reading about another classifier, RVKDE, which is has the same idea as KNN, except we put more weight on the closer neighbors. It also runs faster than SVM O(n log n) vs O(n2), with similar performance.

Comment by Lee — December 31, 2009 @ 7:33 pm

Thanks, I am just starting my non-layman trek into these subjects with an AI class in the Spring :D. I am looking forward to more informative posts like this.

As previous commenter said, defining terms such as KNN would be helpful (essential, even); embedding a Wikipedia link as reference would be even better. Of course it all depends on your target audience.

Cheers.

Comment by Harlan Iverson — December 31, 2009 @ 7:40 pm

If you don’t mind I have a couple of questions that always bug me when reading Machine Learning and AI articles.

When was the last time you had a discussion with a neuroscientist? Or do you follow the neuroscience literature?

Ian

Comment by Ian Danforth — December 31, 2009 @ 7:41 pm

Hi Ian,

I haven’t followed neuroscience literature particularly closely but have always been interested in it. Feel free to drop me an email (jason at awwthor.com) and I’d be happy to try and help you out.

Comment by awwthor — December 31, 2009 @ 7:56 pm

Cool blog post!

The support vector machines applet you linked to wouldn’t run on my machine (chrome/win7), but I found an alternative through google to play with, here:

http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml

Thanks for the intro!

Comment by manthrax — December 31, 2009 @ 7:50 pm

Interesting post, I would have to agree with Peter about defining KNN’s although a quick google define:KNN does help. I have been looking at Artificial Neural Networks (ANN) for a few weeks now and I would say that this helps me understand it a smidget better but it creates many more questions for me, but an excellent starting-mid point, thank you for the post!

Comment by Tofu — December 31, 2009 @ 8:01 pm

Thanks! Signing up for email updates from your blog.

Comment by Ethan — December 31, 2009 @ 8:06 pm

I think it was thousand foot deeper ,

Comment by sethu1986 — December 31, 2009 @ 8:09 pm

What software did you use to make the diagrams in your post ?

Comment by blackberryoctopus — December 31, 2009 @ 8:44 pm

This was a trip down memory lane for me. I did my senior design project on finding the best smoothing paramaters for the PNN Probabalistic Neural Network and GRNN General Regression Neural Network (For those of you who want it defined :P ) The KNN sounds a bit more advanced than the models that I was working on. Good luck with your research.

Comment by Dave — December 31, 2009 @ 9:15 pm

Thanks for the post! I’m a student at CMU and I’m interested in applying ML to trading as well. I’m looking forward to updates.

Comment by gus — January 1, 2010 @ 12:07 am

Author forgot to mention reinforcement learning, that is different from both supervised and unsupervised learning.

Comment by yulia — January 1, 2010 @ 12:22 am

[…] A Thousand Foot View of Machine Learning Since I plan to spend a fair amount of time on this blog talking about machine learning, I thought it would make sense […] […]

Pingback by Top Posts — WordPress.com — January 1, 2010 @ 12:33 am

I hope you continue your blog. I’ve been fascinated by issues in machine learning since the early 80’s, and created the early curriculum for an AI program at my university back then. However, I took a path that led me away from these topics for the decades since.

Look forward to reading more.

Comment by Ted — January 1, 2010 @ 1:25 am

Might be a good idea to mention semi-supervised learning and reinforcement learning as well.

Comment by Abhishek Ghose — January 1, 2010 @ 8:56 am

I hope to see more, very well written. As an applications person, machine intelligence is out of my depth, but not my grasp. Keep it up.

Comment by Conrad Braam — January 1, 2010 @ 10:26 am

Great article, though it would have been better if you have defined KNN before, as one of the commenters has posted here. I came along fascinated by the title ‘a thousand foot view of machine learning’ so I expected it written to a non-techie audience that arent students without assumed base knowledge of stuff like KNN.

Comment by Mohan Arun L — January 1, 2010 @ 11:46 am