# The Problem of Many Predictors – Ridge Regression and Kernel Ridge Regression

You might imagine that there is an iron law of ordinary least squares (OLS) regression – the number of observations on the dependent (target) variable and associated explanatory variables must be less than the number of explanatory variables (regressors).

Ridge regression is one way to circumvent this requirement, and to estimate, say, the value of p regression coefficients, when there are N<p training sample observations.

This is very helpful in all sorts of situations.

Instead of viewing many predictors as a variable selection problem (selecting a small enough subset of the p explanatory variables which are the primary drivers), data mining operations can just use all the potential explanatory variables, if the object is primarily predicting the value of the target variable. Note, however, that ridge regression exploits the tradeoff between bias and variance – producing biased coefficient estimates with lower variance than OLS (if, in fact, OLS can be applied).

A nice application was developed by Edward Malthouse some years back. Malthouse used ridge regression for direct marketing scoring models (search and you will find a downloadable PDF). These are targeting models to identify customers for offers, so the response to a mailing is maximized. A nice application, but pre-social media in its emphasis on the postal service.

In any case, Malthouse’s ridge regressions provided superior targeting capabilities. Also, since the final list was the object, rather than information about the various effects of drivers, ridge regression could be accepted as a technique without much worry about the bias introduced in the individual parameter estimates.

Matrix Solutions for Ordinary and Ridge Regression Parameters

Before considering spreadsheets, let’s highlight the similarity between the matrix solutions for OLS and ridge regression. Readers can skip this section to consider the worked spreadsheet examples.

Suppose we have data which consists of N observations or cases on a target variable y and vector of explanatory variables x,

y1           x11         x12         ..             x1p

y2           x21         x22         ..             x2p

………………………………….

yN          xN1        xN2        ..             xNp

Here yi is the ith observation on the target variable, and xi=(xi1,xi2,..xip) are the associated values for p (potential) explanatory variables, i=1,2,..,N.

So we are interested in estimating the parameters of a relationship Y=f(X1,X2,..Xk).

Assuming f(.) is a linear relationship, we search for the values of k+1 parameters (β01,…,βp) such that  Σ(y-f(x))2 minimizes the sum of all the squared errors over the data – or sometimes over a subset called the training data, so we can generate out-of-sample tests of model performance.

Following Hastie, Tibshirani, and Friedman, the Regression Sum of Squares (RSS) can be expressed,

The solution to this least squares error minimization problem can be stated in a matrix formula,

β= (XTX)-1XTY

where X is the data matrix, Here XT denotes the transpose of the matrix X.

Now ridge regression involves creating a penalty in the minimization of the squared errors designed to force down the absolute size of the regression coefficients. Thus, the minimization problem is

This also can be solved analytically in a closed matrix formula, similar to that for OLS –

βridge= (XTX-λІ)-1XTY

Here λ is a penalty or conditioning factor, and I is the identity matrix. This conditioning factor λ, it should be noted, is usually determined by cross-validation – holding back some sample data and testing the impact of various values of λ on the goodness of fit of the overall relationship on this holdout or test data.

Ridge Regression in Excel

So what sort of results can be obtained with ridge regression in the context of many predictors?

Consider the following toy example.

By construction, the true relationship is

y = 2x1 + 5x2+0.25x1x2+0.5x12+1.5x22+0.5x1x22+0.4x12x2+0.2x13+0.3x23

so the top row with the numbers in bold lists the “true” coefficients of the relationship.

Also, note that, strictly speaking, this underlying equation is not linear, since some exponents of explanatory variables are greater than 1, and there are cross products.

Still, for purposes of estimation we treat the setup as though the data come from ten separate explanatory variables, each weighted by separate coefficients.

Now, assuming no constant term and mean-centered data. the data matrix X is 6 rows by 10 columns, since there are six observations or cases and ten explanatory variables. Thus, the transpose XT is a 10 by 6 matrix. Accordingly, the product XTX is a 10 by 10 matrix, resulting in a 10 by 10 inverse matrix after the conditioning factor and identity matrix is added in to XTX.

The ridge regression formula above, therefore, gives us estimates for ten beta-hats, as indicated in the following chart, using a λ or conditioning coefficient of .005.

The red bars indicate the true coefficient values, and the blue bars are the beta-hats estimated by the ridge regression formula.

As you can see, ridge regression does get into the zone in terms of these ten coefficients of this linear expression, but with only 6 observations, the estimate is very approximate.

The Kernel Trick

Note that in order to estimate the ten coefficients by ordinary ridge regression, we had to invert a 10 by 10 matrix XTX. We also can solve the estimation problem by inverting a 6 by 6 matrix, using the kernel trick, whose derivation is outlined in a paper by Exertate.

The key point is that kernel ridge regression is no different from ordinary ridge regression…except for an algebraic trick.

To show this, we applied the ridge regression formula to the 6 by 10 data matrix indicated above, estimating the ten coefficients, using a λ or conditioning coefficient of .005. These coefficients broadly resemble the true values.

The above matrix formula works for our linear expression in ten variables, which we can express as

y = β1x1+ β2x2+… + β10x10

Now with suitable pre- and post-multiplications and resorting, it is possible to switch things around to arrive at another matrix formula,

The following table shows beta-hats estimated by these two formulas are similar and compares them with the “true” values of the coefficients.

Differences in the estimates by these formulas relate strictly to issues at the level of numerical analysis and computation.

Kernels

Notice that the ten variables could correspond to a Taylor expansion which might be used to estimate the value of a nonlinear function. This is important and illustrates the concept of a “kernel”.

Thus, designating K = XXwe find that the elements of K can be obtained without going through the indicated multiplication of these two matrices. This is because K is a polynomial kernel.

The second matrix formula listed just above involves inverting a smaller matrix, than the original formula – in our example, a 6 by 6, rather than a 10 by 10 matrix. This does not seem like a big deal with this toy example, but in Big Data and data mining applications, involving matrices with hundreds or thousands of rows and columns, the reduction in computation burden can be significant.

Summing Up

There is a great deal more that can be said about this example and the technique in general. Two big areas are (a) arriving at the estimate of the conditioning factor λ and (b) discussing the range of possible kernels that can be used, what makes a kernel a kernel, how to generate kernels from existing kernels, where Hilbert spaces come into the picture, and so forth.

But perhaps the important thing to remember is that ridge regression is one way to pry open the problem of many predictors, making it possible to draw on innumerable explanatory variables regardless of the size of the sample (within reason of course). Other techniques that do this include principal components regression and the lasso.

# Links – 2014, Early January

US and Global Economy

Bernanke sees headwinds fading as US poised for growth – happy talk about how good things are going to be as quantitative easing is “tapered.”

Slow Growth and Short Tails But Dr. Doom (Nouriel Roubini) is guardedly optimistic about 2014

The good news is that economic performance will pick up modestly in both advanced economies and emerging markets. The advanced economies, benefiting from a half-decade of painful private-sector deleveraging (households, banks, and non-financial firms), a smaller fiscal drag (with the exception of Japan), and maintenance of accommodative monetary policies, will grow at an annual pace closer to 1.9%. Moreover, so-called tail risks (low-probability, high-impact shocks) will be less salient in 2014. The threat, for example, of a eurozone implosion, another government shutdown or debt-ceiling fight in the United States, a hard landing in China, or a war between Israel and Iran over nuclear proliferation, will be far more subdued.

GOLDMAN: Here’s What Will Happen With GDP, Housing, The Fed, And Unemployment Next year Goldman Sachs chief economist Jan Hatzius writes: 10 Questions for 2014  – Jan Hatzius is very bullish on 2014!

Three big macro questions for 2014 Gavyn Davies – tapering QE, China, and the euro. Requires free registration to read.

The State of the Euro, In One Graph From Paul Krugman, the point being that the EU’s austerity policies have significantly worsened the debt ratios of Spain, Portugal, Ireland, Greece, and Italy, despite lower interest rates. (Click to enlarge)

Technology

JCal’s 2014 predictions: Intense competition for YouTube and a shake up in online video economics

Rumblings in the YouTube community in the midst of tremendous growth in video productions – interesting.

Do disruptive technologies really overturn market leadership?

Discusses tests of the idea that ..such technologies have the characteristic that they perform worse on an important metric (or metrics) than current market leading technologies. Of course, if that were it, then the technologies could hardly be called disruptive and would be confined, at best, to niche uses.

The second critical property of such technologies is that while they start behind on key metrics, they improve relatively rapidly and eventually come to outperform existing technologies on many metrics. It is there that disruptive technologies have their bite. Initially, they are poor performers and established firms would not want to integrate them into their products as they would disappoint their customers who happen to be most of the current market. However, when performance improves, the current technologies are displaced and established firms want to get in on the game. The problem is that they may be too late. In other words, Christensen’s prediction was that established firms would have legitimate “blind spots” with regard to disruptive technologies leaving room open for new entrants to come in, adopt those technologies and, ultimately, displace the established firms as market leaders.

Big Data – A Big Opportunity for Telecom Players

Today with sharp increase in online and mobile shopping with use of Apps, telecom companies have access to consumer buying behaviours and preference which are actually being used with real time geo-location and social network analysis to target consumers. Hmmm.

5 Reasons Why Big Data Will Crush Big Research

Traditional marketing research or “big research” focuses disproportionately on data collection.  This mentality is a hold-over from the industry’s early post-WWII boom –when data was legitimately scarce.  But times have changed dramatically since Sputnik went into orbit and the Ford Fairlane was the No. 1-selling car in America.

Here is why big data is going to win.

Reason 1: Big research is just too small…Reason 2 : Big research lacks relevance… Reason 3: Big research doesn’t handle complexity well… Reason 4: Big research’s skill sets are outdated…  Reason 5: Big research lacks the will to change…

I know “market researchers” who fit the profile in this Forbes article, and who are more or less lost in the face of the new extent of data and techniques for its analysis. On the other hand, I hear from the grapevine that many executives and managers can’t really see what the Big Data guys in their company are doing. There are success stories on the Internet (see the previous post here, for example), but this may be best case. Worst case is a company splurges on the hardware to implement Big Data analytics, and the team just comes up with gibberish – very hard to understand relationships with no apparent business value.

Some 2013 Recaps

Top Scientific Discoveries of 2013

Humankind goes interstellar ..Genome editing ..Billions and billions of Earths

Global warming: a cause for the pause ..See-through brains ..Intergalactic Neutrinos ..A new meat-eating mammal

Pesticide controversy grows ..Making organs from stem cells ..Implantable electronics ..Dark matter shows up — or doesn’t ..Fears of the fathers

The 13 Most Important Charts of 2013

And finally, a miscellaneous item. Hedge funds apparently do beat the market, or at least companies operating in the tail of the performance distribution show distinctive characteristics.

How do Hedge Fund “Stars” Create Value? Evidence from Their Daily Trades

I estimate hedge fund performance by computing calendar-time transaction portfolios (see, e.g., Seasholes and Zhu, 2010) with holding periods ranging from 21 to 252 days. Across all holding periods, I find no evidence that the average or median hedge fund outperforms, after accounting for trading commissions. However, I find significant evidence of outperformance in the right-tail of the distribution. Specifically, bootstrap simulations indicate that the annual performance of the top 10-30% of hedge funds cannot be explained by luck. Similarly, I find that superior performance persists. The top 30% of hedge funds outperform by a statistically significant 0.25% per month over the subsequent year. In sharp contrast to my hedge fund findings, both bootstrap simulations and performance persistence tests fail to reveal any outperformance among non-hedge fund institutional investors….

My remaining tests investigate how outperforming hedge funds (i.e., “star” hedge funds) create value. My main findings can be summarized as follows. First, star hedge funds’ profits are concentrated over relatively short holding periods. Specifically, more than 25% (50%) of star hedge funds’ annual outperformance occurs within the first month (quarter) after a trade. Second, star hedge funds tend to be short-term contrarians with small price impacts. Third, the profits of star hedge funds are concentrated in their contrarian trades. Finally, the performance persistence of star hedge funds is substantially stronger among funds that follow contrarian strategies (or funds with small price impacts) and is not at all present for funds that follow momentum strategies (or funds with large price impacts).