Kernel ridge regression (KRR) is an interesting and apparently promising technique in business forecasting, when there are “fat” databases. This is intrinsically a Big Data methodology, significantly encompassing many predictors and nonlinearity.
Kernel ridge regression, however, is shrouded in a high degree of mathematical complexity. While this is certainly not window-dressing, it can obscure the fact that kernel ridge regression is no different from ordinary ridge regression on transformations of the regressors, except for an algebraic trick to improve computational efficiency.
This post develops a computational example or kernel ridge regression in an Excel spreadsheet to show the above point – kernel ridge regression is no different from ordinary ridge regression…except for an algebraic trick.
Most applications of KRR have been in the area of machine learning, especially optical character recognition.
To date, the primary business or economic forecasting application involves a well-known “fat” macroeconomic database. After updating this database, researchers from the Tinbergen Institute and Erasmus University develop KRR models which outperform principal component models in out-of-sample forecasts of key macroeconomic variables, such as real industrial production and employment.
Several white papers are key to this effort to apply KRR to business forecasting, including,
This research holds out great promise for KRR, concluding in ” Nonlinear Forecasting …” that,
The empirical application to forecasting four key U.S. macroeconomic variables — production, income, sales, and employment — shows that kernel-based methods are often preferable to, and always competitive with, well-established autoregressive and principal-components-based methods. Kernel techniques also outperform previously proposed extensions of the standard PC-based approach to accommodate nonlinearity.
Computing a Kernel Ridge Regression
To understand a KRR, it is helpful to begin with the formula for ridge regression.
Here, X is the data matrix, XT is the transpose of X, λ is the conditioning factor, I is the identify matrix, and y is a vector of values of the dependent or target variable. The “beta-hats” are estimated β’s or coefficient values in the conventional linear regression equation,
y = β1x1+ β2x2+… βNxN
A recent post in this blog used this formula for a highly correlated and multicollinear pair of variables x1 and x2, showing that ridge regression provides values for the beta-hats which are much closer to the true values than standard or ordinary least squares (OLS) regression.
Now let us consider ridge regression with a different set of data.
By construction, the true relationship is
y = 2x1 + 5x2+0.25x1x2+0.5x12+1.5x22+0.5x1x22+0.4x12x2+0.2x13+0.3x23
Now, unlike the post with multicollinear data, this current example data has more explanatory variables (10) than observations on the dependent variable. Thus, it impossible to estimate the beta’s by OLS. But, interestingly, we can make a stab at estimating the values of the coefficients of the true relationship between y and the data on the explanatory variables with the ridge regression formula.
The data matrix X is 6 rows by 10 columns, since there are six observations or cases and ten explanatory variables. Thus, the transpose XT is a 10 by 6 matrix. Accordingly, the product XTX is a 10 by 10 matrix, resulting in a 10 by 10 inverse matrix after the conditioning factor and identity matrix is added in to XTX.
The ridge regression formula above, therefore, gives us estimates for ten beta-hats, as indicated in the following chart, using a λ or conditioning coefficient of .005.
The red bars indicate the true coefficient values, and the blue bars are the beta-hats estimated by the ridge regression formula.
As you can see, ridge regression does get into the zone in terms of these ten coefficients of this linear expression, but with only 6 observations, the estimate is very approximate.
This is a longer post, and I have decided to complete the specifics of the “albegraic trick” in a followup post.
In the meanwhile, the Peter Exterkate paper on nonlinear predictors has a good matrix derivation which readers may want to peruse, before seeing it implemented in a spreadsheet.
Note that in order to estimate the ten coefficients by ordinary ridge regression, we had to invert a 10 by 10 matrix XTX.
Question, when we implement the following re-write of the ridge regression formula what will the dimension of the matrix we have to invert be?