# Some Ways in Which Bayesian Methods Differ From the “Frequentist” Approach

I’ve been doing a deep dive into Bayesian materials, the past few days. I’ve tried this before, but I seem to be making more headway this time.

One question is whether Bayesian methods and statistics informed by the more familiar frequency interpretation of probability can give different answers.

I found this question on CrossValidated, too – Examples of Bayesian and frequentist approach giving different answers.

Among other things, responders cite YouTube videos of John Kruschke – the author of Doing Bayesian Data Analysis A Tutorial With R and BUGS

Here is Kruschke’s “Bayesian Estimation Supercedes the t Test,” which, frankly, I recommend you click on after reading the subsequent comments here.

I guess my concern is not just whether Bayesian and the more familiar frequentist methods give different answers, but, really, whether they give different predictions that can be checked.

I get the sense that Kruschke focuses on the logic and coherence of Bayesian methods in a context where standard statistics may fall short.

But I have found a context where there are clear differences in predictive outcomes between frequentist and Bayesian methods.

This concerns Bayesian versus what you might call classical regression.

In lecture notes for a course on Machine Learning given at Ohio State in 2012, Brian Kulis demonstrates something I had heard mention of two or three years ago, and another result which surprises me big-time.

Let me just state this result directly, then go into some of the mathematical details briefly.

Suppose you have a standard ordinary least squares (OLS) linear regression, which might look like,

where we can assume the data for y and x are mean centered. Then, as is well, known, assuming the error process ε is N(0,σ) and a few other things, the BLUE (best linear unbiased estimate) of the regression parameters w is –

Now Bayesian methods take advantage of Bayes Theorem, which has a likelihood function and a prior probability on the right hand side of the equation, and the resulting posterior distribution on the left hand side of the equation.

What priors do we use for linear regression in a Bayesian approach?

Well, apparently, there are two options.

First, suppose we adopt priors for the predictors x, and suppose the prior is a normal distribution – that is the predictors are assumed to be normally distributed variables with various means and standard deviations.

In this case, amazingly, the posterior distribution for a Bayesian setup basically gives the equation for ridge regression.

On the other hand, assuming a prior which is a Laplace distribution gives a posterior distribution which is equivalent to the lasso.

This is quite stunning, really.

Obviously, then, predictions from an OLS regression, in general, will be different from predictions from a ridge regression estimated on the same data, depending on the value of the tuning parameter λ (See the post here on this).

Similarly with a lasso regression – different forecasts are highly likely.

Now it’s interesting to question which might be more accurate – the standard OLS or the Bayesian formulations. The answer, of course, is that there is a tradeoff between bias and variability effected here. In some situations, ridge regression or the lasso will produce superior forecasts, measured, for example, by root mean square error (RMSE).

This is all pretty wonkish, I realize. But it conclusively shows that there can be significant differences in regression forecasts between the Bayesian and frequentist approaches.

What interests me more, though, is Bayesian methods for forecast combination. I am still working on examples of these procedures. But this is an important area, and there are a number of studies which show gains in forecast accuracy, measured by conventional metrics, for Bayesian model combinations.

# Estimation and Variable Selection with Ridge Regression and the LASSO

I’ve posted on ridge regression and the LASSO (Least Absolute Shrinkage and Selection Operator) some weeks back.

Here I want to compare them in connection with variable selection  where there are more predictors than observations (“many predictors”).

1. Ridge regression does not really select variables in the many predictors situation. Rather, ridge regression “shrinks” all predictor coefficient estimates toward zero, based on the size of the tuning parameter λ. When ordinary least squares (OLS) estimates have high variability, ridge regression estimates of the betas may, in fact, produce lower mean square error (MSE) in prediction.

2. The LASSO, on the other hand, handles estimation in the many predictors framework and performs variable selection. Thus, the LASSO can produce sparse, simpler, more interpretable models than ridge regression, although neither dominates in terms of predictive performance. Both ridge regression and the LASSO can outperform OLS regression in some predictive situations – exploiting the tradeoff between variance and bias in the mean square error.

3. Ridge regression and the LASSO both involve penalizing OLS estimates of the betas. How they impose these penalties explains why the LASSO can “zero” out coefficient estimates, while ridge regression just keeps making them smaller. From
An Introduction to Statistical Learning

Similarly, the objective function for the LASSO procedure is outlined by An Introduction to Statistical Learning, as follows

4. Both ridge regression and the LASSO, by imposing a penalty on the regression sum of squares (RWW) shrink the size of the estimated betas. The LASSO, however, can zero out some betas, since it tends to shrink the betas by fixed amounts, as λ increases (up to the zero lower bound). Ridge regression, on the other hand, tends to shrink everything proportionally.

5.The tuning parameter λ in ridge regression and the LASSO usually is determined by cross-validation. Here are a couple of useful slides from Ryan Tibshirani’s Spring 2013 Data Mining course at Carnegie Mellon.

6.There are R programs which estimate ridge regression and lasso models and perform cross validation, recommended by these statisticians from Stanford and Carnegie Mellon. In particular, see glmnet at CRAN. Mathworks MatLab also has routines to do ridge regression and estimate elastic net models.

Here, for example, is R code to estimate the LASSO.

lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
plot(lasso.mod)
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)
bestlam=cv.out\$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)
out=glmnet(x,y,alpha=1,lambda=grid)
lasso.coef=predict(out,type=”coefficients”,s=bestlam)[1:20,]
lasso.coef
lasso.coef[lasso.coef!=0]

What You Get

I’ve estimated quite a number of ridge regression and LASSO models, some with simulated data where you know the answers (see the earlier posts cited initially here) and other models with real data, especially medical or health data.

As a general rule of thumb, An Introduction to Statistical Learning notes,

..one might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero. Ridge regression will perform better when the response is a function of many predictors, all with coefficients of roughly equal size.

The R program glmnet linked above is very flexible, and can accommodate logistic regression, as well as regression with continuous, real-valued dependent variables ranging from negative to positive infinity.

# The Problem of Many Predictors – Ridge Regression and Kernel Ridge Regression

You might imagine that there is an iron law of ordinary least squares (OLS) regression – the number of observations on the dependent (target) variable and associated explanatory variables must be less than the number of explanatory variables (regressors).

Ridge regression is one way to circumvent this requirement, and to estimate, say, the value of p regression coefficients, when there are N<p training sample observations.

This is very helpful in all sorts of situations.

Instead of viewing many predictors as a variable selection problem (selecting a small enough subset of the p explanatory variables which are the primary drivers), data mining operations can just use all the potential explanatory variables, if the object is primarily predicting the value of the target variable. Note, however, that ridge regression exploits the tradeoff between bias and variance – producing biased coefficient estimates with lower variance than OLS (if, in fact, OLS can be applied).

A nice application was developed by Edward Malthouse some years back. Malthouse used ridge regression for direct marketing scoring models (search and you will find a downloadable PDF). These are targeting models to identify customers for offers, so the response to a mailing is maximized. A nice application, but pre-social media in its emphasis on the postal service.

In any case, Malthouse’s ridge regressions provided superior targeting capabilities. Also, since the final list was the object, rather than information about the various effects of drivers, ridge regression could be accepted as a technique without much worry about the bias introduced in the individual parameter estimates.

Matrix Solutions for Ordinary and Ridge Regression Parameters

Before considering spreadsheets, let’s highlight the similarity between the matrix solutions for OLS and ridge regression. Readers can skip this section to consider the worked spreadsheet examples.

Suppose we have data which consists of N observations or cases on a target variable y and vector of explanatory variables x,

y1           x11         x12         ..             x1p

y2           x21         x22         ..             x2p

………………………………….

yN          xN1        xN2        ..             xNp

Here yi is the ith observation on the target variable, and xi=(xi1,xi2,..xip) are the associated values for p (potential) explanatory variables, i=1,2,..,N.

So we are interested in estimating the parameters of a relationship Y=f(X1,X2,..Xk).

Assuming f(.) is a linear relationship, we search for the values of k+1 parameters (β01,…,βp) such that  Σ(y-f(x))2 minimizes the sum of all the squared errors over the data – or sometimes over a subset called the training data, so we can generate out-of-sample tests of model performance.

Following Hastie, Tibshirani, and Friedman, the Regression Sum of Squares (RSS) can be expressed,

The solution to this least squares error minimization problem can be stated in a matrix formula,

β= (XTX)-1XTY

where X is the data matrix, Here XT denotes the transpose of the matrix X.

Now ridge regression involves creating a penalty in the minimization of the squared errors designed to force down the absolute size of the regression coefficients. Thus, the minimization problem is

This also can be solved analytically in a closed matrix formula, similar to that for OLS –

βridge= (XTX-λІ)-1XTY

Here λ is a penalty or conditioning factor, and I is the identity matrix. This conditioning factor λ, it should be noted, is usually determined by cross-validation – holding back some sample data and testing the impact of various values of λ on the goodness of fit of the overall relationship on this holdout or test data.

Ridge Regression in Excel

So what sort of results can be obtained with ridge regression in the context of many predictors?

Consider the following toy example.

By construction, the true relationship is

y = 2x1 + 5x2+0.25x1x2+0.5x12+1.5x22+0.5x1x22+0.4x12x2+0.2x13+0.3x23

so the top row with the numbers in bold lists the “true” coefficients of the relationship.

Also, note that, strictly speaking, this underlying equation is not linear, since some exponents of explanatory variables are greater than 1, and there are cross products.

Still, for purposes of estimation we treat the setup as though the data come from ten separate explanatory variables, each weighted by separate coefficients.

Now, assuming no constant term and mean-centered data. the data matrix X is 6 rows by 10 columns, since there are six observations or cases and ten explanatory variables. Thus, the transpose XT is a 10 by 6 matrix. Accordingly, the product XTX is a 10 by 10 matrix, resulting in a 10 by 10 inverse matrix after the conditioning factor and identity matrix is added in to XTX.

The ridge regression formula above, therefore, gives us estimates for ten beta-hats, as indicated in the following chart, using a λ or conditioning coefficient of .005.

The red bars indicate the true coefficient values, and the blue bars are the beta-hats estimated by the ridge regression formula.

As you can see, ridge regression does get into the zone in terms of these ten coefficients of this linear expression, but with only 6 observations, the estimate is very approximate.

The Kernel Trick

Note that in order to estimate the ten coefficients by ordinary ridge regression, we had to invert a 10 by 10 matrix XTX. We also can solve the estimation problem by inverting a 6 by 6 matrix, using the kernel trick, whose derivation is outlined in a paper by Exertate.

The key point is that kernel ridge regression is no different from ordinary ridge regression…except for an algebraic trick.

To show this, we applied the ridge regression formula to the 6 by 10 data matrix indicated above, estimating the ten coefficients, using a λ or conditioning coefficient of .005. These coefficients broadly resemble the true values.

The above matrix formula works for our linear expression in ten variables, which we can express as

y = β1x1+ β2x2+… + β10x10

Now with suitable pre- and post-multiplications and resorting, it is possible to switch things around to arrive at another matrix formula,

The following table shows beta-hats estimated by these two formulas are similar and compares them with the “true” values of the coefficients.

Differences in the estimates by these formulas relate strictly to issues at the level of numerical analysis and computation.

Kernels

Notice that the ten variables could correspond to a Taylor expansion which might be used to estimate the value of a nonlinear function. This is important and illustrates the concept of a “kernel”.

Thus, designating K = XXwe find that the elements of K can be obtained without going through the indicated multiplication of these two matrices. This is because K is a polynomial kernel.

The second matrix formula listed just above involves inverting a smaller matrix, than the original formula – in our example, a 6 by 6, rather than a 10 by 10 matrix. This does not seem like a big deal with this toy example, but in Big Data and data mining applications, involving matrices with hundreds or thousands of rows and columns, the reduction in computation burden can be significant.

Summing Up

There is a great deal more that can be said about this example and the technique in general. Two big areas are (a) arriving at the estimate of the conditioning factor λ and (b) discussing the range of possible kernels that can be used, what makes a kernel a kernel, how to generate kernels from existing kernels, where Hilbert spaces come into the picture, and so forth.

But perhaps the important thing to remember is that ridge regression is one way to pry open the problem of many predictors, making it possible to draw on innumerable explanatory variables regardless of the size of the sample (within reason of course). Other techniques that do this include principal components regression and the lasso.