The method of principal components regression has achieved new prominence in machine learning, data reduction, and forecasting over the last decade.
It’s highly relevant in the era of Big Data, because it facilitates analyzing “fat” or wide databases. Fat databases have more predictors than observations. So you might have ten years of monthly data on sales, but 1000 potential predictors, meaning your database would be 120 by 1001 – obeying here the convention of stating row depth first and the number of columns second.
I’ve made several points about principal components in recent posts, and I want to provide further links and resources here.
There are four possible threads to this discussion – (1) the bottom line, what you get when you do principal components, examples, (2) an intermediate conceptual discussion, how taking the first few principal components is justified, how this reduces the dimensionality of the dataset, how this is a type of data reduction, and how it works, (3) the underlying theory, basically matrix analysis, applied math, and (4) software – how R does principal components, how SPSS does principal components, how Matlab does principal components, and so forth.
The Bottom Line
In terms of forecasting, a lot of research over the past decade has focused on “many predictors” and reducing the dimensionality of “fat” databases. Key names are James Stock and Mark Watson (see also) and Bai.
Stock and Watson have a white paper that has been updated several times, which can be found in PDF format at this link
stock watson generalized shrinkage June _2012.pdf
They write in the June 2012 update,
We find that, for most macroeconomic time series, among linear estimators the DFM forecasts make efficient use of the information in the many predictors by using only a small number of estimated factors. These series include measures of real economic activity and some other central macroeconomic series, including some interest rates and monetary variables. For these series, the shrinkage methods with estimated parameters fail to provide mean squared error improvements over the DFM. For a small number of series, the shrinkage forecasts improve upon DFM forecasts, at least at some horizons and by some measures, and for these few series, the DFM might not be an adequate approximation. Finally, none of the methods considered here help much for series that are notoriously difficult to forecast, such as exchange rates, stock prices, or price inflation.
Here DFM refers to dynamic factor models, essentially principal components models which utilize PC’s for lagged data.
Note also that this type of autoregressive or classical time series approach does not work well, in Stock and Watson’s judgment, for “series that are notoriously difficult to forecast, such as exchange rates, stock prices, or price inflation.”
Presumably, these series are closer to being random walks in some configuration.
Intermediate Level Concepts
Essentially, you can take any bundle of data and compute the principal components. If you mean-center and (in most cases) standardize the data, the principal components divide up the variance of this data, based on the size of their associated eigenvalues. The associated eigenvectors can be used to transform the data into an equivalent and same size set of orthogonal vectors. Really, the principal components operate to change the basis of the data, transforming it into an equivalent representation, but one in which all the variables have zero correlation with each other.
The Wikipaedia article on principal components is useful, but there is no getting around the fact that principal components can only really be understood with matrix algebra.
Often you see a diagram, such as the one below, showing a cloud of points distributed around a line passing through the origin of a coordinate system, but at an acute angle to those coordinates.
This illustrates dimensionality reduction with principal components. If we express all these points in terms of this rotated set of coordinates, one of these coordinates – the signal – captures most of the variation in the data. Projections of the datapoints onto the second principal component, therefore, account for much less variance.
Principal component regression characteristically specifies only the first few principal components in the regression equation, knowing that, typically, these explain the largest portion of the variance in the data.
It’s also noteworthy that some researchers are talking about “targeted” principal components. So the first few principal components account for the largest, the next largest, and so on amount of variance in the data. However, the “data” in this context does not include the information we have on the target variable. Targeted principal components therefore involves first developing the simple correlations between the target variable and all the potential predictors, then ordering these potential predictors from highest to lowest correlation. Then, by one means or another, you establish a cutoff, below which you exclude weak potential predictors from the data matrix you use to compute the principal components. Interesting approach which makes sense. Testing it with a variety of examples seems in order.
The Underlying Theory
I don’t think it is possible to do much better than to watch Andrew Ng at Stanford in Lectures 14 and 15. I recommend skipping to 17:09 – seventeen minutes and nine seconds – into Lecture 14, where Ng begins the exposition of principal components. He winds up this Lecture with a fascinating illustration of high dimensionality principal component analysis applied to recognizing or categorizing faces in photographs. Lecture 15 also is very useful – especially as it highlights the role of the Singular Value Decomposition (SVD) in calculating principal components for fat databases.
Lecture 14 http://www.youtube.com/watch?v=ey2PE5xi9-A
Lecture 15 http://www.youtube.com/watch?v=QGd06MTRMHs
Software
Let me promise a post specifically on software for principal components later. This post probably will also mention something about the differences in working with fat and thin data matrices.


Recent Comments