So how do you do ridge regression in Excel? How does ridge regression improve coefficient estimates in the case of ill-conditioned data matrices?
Here is an Excel-based example of how ridge regression “cures” effects of multicollinearity.
I created a three variable system, using variables x1, x2, and their product x1x2. The generating equation is
y = 3x1+2x2+0.5 x1x2+ e
where e is the error term distributed as a normal distribution with zero mean and standard deviation 10.
The relevant Excel function is e =norminv(rand(),0,10), where rand() is the canned Excel random number function.
I used the Excel random number function x1 = rand() to generate 80 instances of the variable x1. Then, I let x2= x1+rand()/100.
When I ran an ordinary least squares (OLS) regression with this data (again, 80 observations, 3 explanatory variables), I got the following result, for one realization of this system.
This is definitely an example of multicollinearity. The Excel regression procedure does not distinguish x1 and x2 effectively. The coefficients of x1 and x2 are radically different than the generating equation above, even though the coefficient for x1x2 is close to the value in the generating equation.
The relevant formula is,
where, as conventionally, the coefficients to be estimated are termed the “beta’s” and “beta-hat” here is the vector of beta estimates produced by the ridge regression.
In this example, X is the 80 row by 3 column data matrix – the columns containing the values of x1 ,x2 and x1x2. XT is the transpose of the data matrix, a 3 by 80 matrix. Accordingly, the product XTX is a 3 by 3 matrix.
The matrix I is a 3 by 3 identity matrix, comprised of all zeros except for 1’s in the main diagonal, running from top left to bottom right of the square matrix. λ is a constant multiplied into each of these 1’s, which conditions XTX .
The whole expression (XTX +λ I)-1 then is a 3 by 3 matrix which multiplies into the product XTy where again XTy is a 3 row by 80 column matrix and y is the 80 by 1 column vector of observations on the dependent or target variable. Accordingly, (XTX +λ I)-1 XTy is a 3 by 1 vector.
The list of coefficient or parameter estimates, then, is also a 3 x 1 vector of estimated coefficients of the three explanatory variables.
How to Choose the Conditioning Factor λ
I’m going to punt on methods of choosing the conditioning factor, but the link above notes,
[The] optimal regularization parameter α [λ] is usually unknown and often in practical problems is determined by an ad hoc method. A possible approach relies on the Bayesian interpretation described above. Other approaches include the discrepancy principle, cross-validation, L-curve method, restricted maximum likelihood and unbiased predictive risk estimator.
How to Setup the Spreadsheet
I used two Excel matrix functions mmult(array1, array2) minverse(array)
So for example the matrix XTX is formed as
Here the range H3:CI5 contains XT and the range D3:F82 contains the 80 row by 3 column data matrix X.
Note that the way you set this up in Excel is first to write the formula in the top left cell of the matrix you are creating. Then, highlight the area for the matrix by holding down the Shift key. Then, press F2 and then simultaneously press the Shift and Control keys. The matrix values will populate the matrix range you have indicated.
Here is a screenshot of the some of the other computations.
λ is in cell M11, and is set here equal to 0.25.
Switching back from formula to value view in the spreadsheet, here are the coefficient estimates provided by this ridge regression -
So, if you are trying to assess the impact of one or other of these variables, this approach does a much better job of getting you into the zone than standard regression – at the cost, it must be added, of the introduction of some bias in the parameter values.
This, of course, is a toy example against the backdrop of the kind of problems faced with Big Data. The number of observations, thus, is significantly greater than the number of explanatory variables in this example. Fat data matrices can be approached with ridge regression, but can entail computational problems – not to mention limitations of the Excel spreadsheet.
However, these issues lead naturally into the next topic I want to explore computationally – which is kernel ridge regression.