# Data Mining and Spurious Correlation

The notion is tricky. So everyone learns, “correlation does not imply causation.” What, then, are spurious or bad, as contrasted with a “nonspurious” or valid correlations? And how can we test for “bad” correlations and guard against them?

These are nontrivial questions in the age of data science and data mining.

There are lots of amusing examples strung across the history of statistics, but, lately issues core to Big Data are surfacing. For example, applications of the lasso to select regressors can be led astray by spurious correlations. Then there is the scary analysis of data mining of financial data in “Spurious Regressions in Financial Economics.” And the more encouraging analysis of Deloitte & Touche researchers “Does Credit Score Really Help Explain Insurance Losses?

Basics

Spurious correlations are classically found working with time series, although Simpson’s paradox shows the problem also can exist with cross-sectional data.

KEY POINT : Whenever you are examining relationships between nonstationary time series, you run the risk of spurious correlations or regressions.

It’s easy to gin up an example to convince yourself of this.

Let’s consider the relationship between two random walks.

Recall that a random walk is defined, in the simplest case, by,

xt = xt-1 + et

where et is conventionally a normally distributed variable with zero mean and constant variance σ or et~N(0,σ).

A random walk is nonstationary. From any time t in the random walk xt the expected value of any future cumulation of the xt+k is simply xt, whatever that is. Accordingly, a random walk is not stationary in the mean. Also, the variance of xt grows without limit as t gets larger. So a random walk does not fluctuate around a level in the same sense that a sequence of random, independent numbers like et do. Rather, a random walk will tend to deviate away from its initial level for unpredictable periods, then cross and possibly deviate the other way, and so forth.

Suppose we generate two random walks, X and Y, and regress X onto Y. The regression coefficient should be close to zero, and the t-statistic should be statistically insignificant. There should be no particular reason why there would be a statistically significant relationship between two, independently generated random series.

Or perhaps more to the point, there could be a relationship for some number of periods, but there would be no basis for expecting this relationship to continue to be predictive.

I believe this is the sense in which spurious correlations and regressions can be distinguished from their valid counterparts.

Thus, a spurious correlation near, say, 1, may exist between two, unrelated random series. But if we look at further realizations of these random series, this correlation will completely vanish with a probability of 1.

To complete the example, the following graph illustrates a realization of two random walks which, quite by chance in this realization, tend to move together for the first 50 or so terms of both time series.

Over these first 50 terms, the correlation coefficient between these random walks is 0.697 – the square root of “R square” in the regression table shown below.

The SUMMARY OUTPUT of the regression of X onto Y in the Excel spreadsheet where all this is created indicates that Y=0.45X+27.91. Furthermore, the SUMMARY OUTPUT tells us that this coefficient 0.45 has a t-statistic of 6.74, greater than the conventional threshold of 95 percent significance of around 2.

Now it is interesting, in this case, that the old rule of thumb – that R2 > DW indicates a spurious regression – fails in this case. So the R2 is 0.487. Analyzing the residuals of the above regression on the first 50 values of these two random walks, we find the Durbin-Watson statistic (DW) = 0.576.

Yet this is a spurious regression, because the relationship Y=0.45X+27.91 is bound to break down, as more and more terms are considered. This relationship really has a zero probability (in the limit) of being predictive.

More Complex Cases

Just as in the discussion of classical linear regression, there is a menagerie of nonstationary time series capable of producing spurious correlations or entering into spurious regressions. A 2009 review article by D. Ventosa-Santaularia of the Universidad de Guanajuato in Mexico provides a survey of the field.

Differencing is the standard remedy, but the extent of differencing needs to be discovered. To my eye, it is not always easy to distinguish a simple random walk from a random walk with drift, which also can produce spurious correlations and regressions.

Then, there is the issue of determining whether a time series is nonstationary in the first place. The huge literature in econometrics on unit roots stands as a testament to the relatively low statistical power of many of the relevant tests. Maybe the series is in fact deterministic, or has a long memory of some sort, and so forth.

Re-enter Causality, Stage Left

Somewhere, in preparing this post, I ran onto a quote of a top Google manager to the effect that spurious correlations were why it was important for people recruited to data mining to have good judgment, to be well-grounded.

I take this to mean that it is usually wise for someone who discovers a relationship via data mining to make sure he or she can “tell the story” of why this relationship exists, when presenting the results to management.

In short, there may be something a tad hypocritical about the mantra that correlation is not causation. In fact, what we really want to discover are correlations that are underpinned by causal relationships. That way, we can have some assurance of the continuation of observed patterns.