Monday, September 7, 2009

Principal Components: Are They A Data Mining Technique?

Principal components have been mentioned in passing several times in previous posts. However, I have not ever talked specifically about them, and their relationship to data mining in general.

What are principal components? There are two common definitions that I do not find particularly insightful. I repeat them here, mostly to illustrate the distance from important mathematical ideas and their application. The first definition is that the principal components are the eigenvectors of the covariance matrix of the variables. The eigenwhats of the what? Knowing enough German to understand that "eigen" means something like "inherent" does not really help in understanding this. An explanation of this -- with lots of mathematical symbols -- is available on Wikipedia. (And, it is not surprising that the inventor of covariance Karl Pearson also invented principal component analysis.)

The second definition (which is equivalent to the first) starts by imagining the data as points in space. Off all the possible lines in the space, the first principal component is the line that maximizes the variance of the points projected on the line (and also goes through the centroid of the data points). Points, lines, projections, centroids, variance -- that also sounds a bit academic. (By the way, for the seriously mathematically inclined, here are pointers to how these defintions are the same.)

I prefer a third, less commonly touted definition, which also assumes that data is spread out as points in space. Of all possible lines in space, the first principal component is the one that minimizes the square of the distance from each data point to the line. Hey, you may be asking, "isn't this the same as the ordinary least squares regression line?" The reason why I like this approach is because it compares principal components to something that almost everyone is familiar with -- the best-fit line. And that provides an opportunity to compare and contrast and learn.

The first difference between the two is both subtle and important. The best-fit line only looks at the distance from each data point to the line along one dimension; that is, the line minimizes the sum of the squares of the differences along the target dimension ("y"). The first principal component is looking at the sum of the squares of the overall distance. The "distance" in this case is the length of the shortest vector that connects each point to the line. In general, the best fit line and the first principal component, are not the same (and I'm curious if the angle between them might be useful). A little known factoid about best fit lines is worth dropping in here. Given a set of data points (x, y), the best fit line that fits y = f(x) is different from the best fit line that fits x = f(y). And the first principal component fits "between" these lines in some sense.

There is a corollary to this. For a best-fit line, one dimension is special, the "y" dimension, because that is how the distance is measured. This is typically the target dimension for a model, the values we want to predict. For the first principal component, there is no special dimension. Hence, principal components are most useful when applied only to input variables without the target. A major difference from best-fit lines.

For me, it makes intuitive sense that the line that best fits input values would be useful for analysis. And, it makes intuitive sense in a way that the eigen-whatevers of some matrix do not intuitively say "useful" or even that the line that maximizes the variance does not say "useful". Even though all are doing the same thing, some ways of explaining the concept seem more intuitive and applicable to data analysis.

Another difference from the best fit line involves what statisticians call residuals -- that is, the difference from each of the original data points to the corresponding point on the line. For a best-fit line, the residuals are simply numbers, the difference between the original "y" and the "y" on the line. For the first principal component, the residuals are vectors -- the vectors that connect each point perpendicularly to the line. These vectors can be plotted in space. And, given a bunch of points in space, we can calculate the principal component for them. This is the second principal component. And these have residuals, and the process can keep going, for a while, yielding the third principal component, and so on.

The first principal component and the second principal component have a very particular property; they are orthogonal to each other, which means that they meet at a right angle. In fact, all principal components are orthogonal to each other, and orthogonality is a good thing when working with input values for data. So, it is tempting to replace the data with the first few principal components. It is not only tempting, but this is often a successful way to reduce the number of variables used for analysis.

By the way, there are not an infinite number of principal components. The number of principal components is the dimensionality of the original data points -- which is never more than the number of variables that define each point.

There is much more to say about principal components. The original question asked whether they are part of data mining. I have never been particularly proud of what is and what is not data mining -- I'm happy to include anything useful for data analysis under the heading. Unlike other techniques, though, principal components are not a fancy method for building predictive or descriptive models. Instead, they are part of the arsenal of tools available for managing and massaging input variables to maximize their utility.

PedroCGD said...

Dear Gordon,
In your opinion, what's the big difference and definition for in-sample and out-sample data?

And Training and Testing apply to in-sample and Validation for out-sample?

Are you imagine any supervised model development without out-sample test?

Regards,
Pedro
www.pedrocgd.blogspot.com

September 9, 2009 8:25 PM
Anonymous said...

Principal Components Analysis is also
some short of continuous k-means:

http://portal.acm.org/citation.cfm?id=1015408

September 18, 2009 4:13 AM