# Revisiting the Predictability of the S&P 500

Almost exactly a year ago, I posted on an algorithm and associated trading model for the S&P 500, the stock index which supports the SPY exchange traded fund.

I wrote up an autoregressive (AR) model, using daily returns for the S&P 500 from 1993 to early 2008. This AR model outperforms a buy-and-hold strategy for the period 2008-2013, as the following chart shows.

The trading algorithm involves “buying the S&P 500” when the closing price indicates a positive return for the following trading day. Then, I “close out the investment” the next trading day at that day’s closing price. Otherwise, I stay in cash.

It’s important to be your own worst critic, and, along those lines, I’ve had the following thoughts.

First, the above graph disregards trading costs. Your broker would have to be pretty forgiving to execute 2000-3000 trades for less than the \$500 you make over the buy-and-hold strategy. SO, I should deduct something for the trades in calculating the cumulative value.

The other criticism concerns high frequency trading. The daily returns are calculated against closing values, but, of course, to use this trading system you have to trade prior to closing. However, even a few seconds can make a crucial difference in the price of the S&P 500 or SPY – and even smaller intervals.

An Up-Dated AR Model

Taking some of these criticisms into account, I re-estimate an autoregressive model on more recent data –again calculating returns against closing prices on successive trading days.

This time I start with an initial investment of \$100,000, and deduct \$5 per trade off the totals as they cumulate.

I also utilize only seven (7) lags for the daily returns. This compares with the 30 lag model from the post a year ago, and I estimate the current model with OLS, rather than maximum likelihood.

The model is

Rt = 0.0007-0.0651Rt-1+0.0486Rt-2-0.0999Rt-3-0.0128Rt-4-0.1256Rt-5 +0.0063Rt-6-0.0140Rt-7

where Rt is the daily return for trading day t. This model originates on data from June 11, 2011. The coefficients of the equation result from bagging OLS regressions – developing coefficient estimates for 100,000 similar size samples drawn with replacement from this dataset of 809 observations. These 100,000 coefficient estimates are averaged to arrive at the numbers shown above.

Here is the result of applying my revised model to recent stock market activity. The results are out-of-sample. In other words, I use the predictive equation which is calculated over data prior to the start of the investment comparison. I also filter the positive predictions for the next day closing price, only acting when they are a certain size or larger.

There is a 2-3 percent return on a hundred thousand dollar investment in one month, and a projected annual return on the order of 20-30 percent.

The current model also correctly predicts the sign of the daily return 58 percent of the time, compared with a much lower figure for the model from a year ago.

This looks like the best thing since sliced bread.

But wait – what about high frequency trading?

I’m exploring the implementation of this model – and maybe should never make it public.

But let me clue you in on what I suspect, and some evidence I have.

So, first, it is interesting the gains from trading on closing day prices more than evaporate by the opening of the New York Stock Exchange, following the generation of a “buy” signal according to this algorithm.

In other words, if you adjust the trading model to buy at the open of the following trading day, when the closing price indicates a positive return for the following day – you do not beat a buy-and-hold strategy. Something happens between the closing and the opening of the NYSE market for the SPY.

Someone else knows about this model?

I’m exploring the “final second’ volatility of the market, focusing on trading days when the closing prices look like they might come in to indicate a positive return the following day. This is complicated, and it puts me into issues of predictability in high frequency data.

I also am looking at the SPY numbers specifically to bring this discussion closer to trading reality.

Bottom line – It’s hard to make money in the market on trading algorithms if you are a day-trader – although probably easier with a super-computer at your command and when you sit within microseconds of executing an order on the NY Stock Exchange.

But these researches serve to indicate one thing fairly clearly. And that is that there definitely are aspects of stock prices which are predictable. Acting on the predictions is the hard part.

And Postscript: Readers may have noticed a lesser frequency of posting on Business Forecast blog in the past week or so. I am spending time running estimations and refreshing and extending my understanding of some newer techniques. Keep checking in – there is rapid development in “real world forecasting” – exciting and whiz bang stuff. I need to actually compute the algorithms to gain a good understanding – and that is proving time-consuming. There is cool stuff in the data warehouse though.

# Business Forecasting – Some Thoughts About Scope

In many business applications, forecasting is not a hugely complex business. For a sales forecasting, the main challenge can be obtaining the data, which may require sifting through databases compiled before and after mergers or other reorganizations. Often, available historical data goes back only three or four years, before which time product cycles make comparisons iffy. Then, typically, you plug the sales data into an automatic forecasting program, one that can assess potential seasonality, and probably employing some type of exponential smoothing, and, bang, you produce forecasts for one to several quarters going forward.

The situation becomes more complex when you take into account various drivers and triggers for sales. The customer revenues and income are major drivers, which lead into assessments of business conditions generally. Maybe you want to evaluate the chances of a major change in government policy or the legal framework – both which are classifiable under “triggers.” What if the Federal Reserve starts raising the interest rates, for example.

For many applications, a driver-trigger matrix can be useful. This is a qualitative tool for presentations to management. Essentially, it helps keep track of assumptions about the scenarios which you expect to unfold from which you can glean directions of change for the drivers – GDP, interest rates, market conditions. You list the major influences on sales in the first column. In the second column you indicate the direction of this influences (+/-) and in the third column you put in the expected direction of change, plus, minus, or no change.

The next step up in terms of complexity is to collect historical data on the drivers and triggers – “explanatory variables” driving sales in the company. This opens the way for a full-blown multivariate model of sales performance. The hitch is to make this operational, you have to forecast the explanatory variables. Usually, this is done by relying, again, on forecasts by other organizations, such as market research vendors, consensus forecasts such as available from the Survey of Professional Forecasters and so forth. Sometimes it is possible to identify “leading indicators” which can be built into multivariate models. This is really the best of all possible worlds, since you can plug in known values of drivers and get a prediction for the target variable.

The value of forecasting to a business is linked with benefits of improvements in accuracy, as well as providing a platform to explore “what-if’s,” supporting learning about the business, customers, and so forth.

With close analysis, it is often possible to improve the accuracy of sales forecasts by a few percentage points. This may not sound like much, but in a business with \$100 million or more in sales, competent forecasting can pay for itself several times over in terms of better inventory management and purchasing, customer satisfaction, and deployment of resources.

Time Horizon

When you get a forecasting assignment, you soon learn about several different time horizons. To some extent, each forecasting time horizon is best approached with certain methods and has different uses.

Conventionally, there are short, medium, and long term forecasting horizons.

In general business applications, the medium term perspective of a few quarters to a year or two is probably the first place forecasting is deployed. The issue is usually the budget, and allocating resources in the organization generally. Exponential smoothing, possibly combined with information about anticipated changes in key drivers, usually works well in this context. Forecast accuracy is a real consideration, since retrospectives on the budget are a common practice. How did we do last year? What mistakes were made? How can we do better?

The longer term forecast horizons of several years or more usually support planning, investment evaluation, business strategy. The M-competitions suggest the issue has to be being able to pose and answer various “what-if’s,” rather than achieving a high degree of accuracy. Of course, I refer here to the finding that forecast accuracy almost always deteriorates in direct proportion to the length of the forecast horizon.

Short term forecasting of days, weeks, a few months is an interesting application. Usually, there is an operational focus. Very short term forecasting in terms of minutes, hours, days is almost strictly a matter of adjusting a system, such as generating electric power from a variety of sources, i.e. combining hydro and gas fired turbines, etc.

As far as techniques, short term forecasting can get sophisticated and mathematically complex. If you are developing a model for minute-by-minute optimization of a system, you may have several months or even years of data at your disposal. There are, thus, more than a half a million minutes in a year.

Forecasting and Executive Decisions

The longer the forecasting horizon, the more the forecasting function becomes simply to “inform judgment.”

A smart policy for an executive is to look at several forecasts, consider several sources of information, before determining a policy or course of action. Management brings judgment to bear on the numbers. It’s probably not smart to just take the numbers on blind faith. Usually, executives, if they pay attention to a presentation, will insist on a coherent story behind the model and the findings, and also checking the accuracy of some points. Numbers need to compute. Round-off-errors need to be buried for purposes of the presentation. Everything should add up exactly.

As forecasts are developed for shorter time horizons and more for direct operation control of processes, acceptance and use of the forecast can become more automatic. This also can be risky, since developers constantly have to ask whether the output of the model is reasonable, whether the model is still working with the new data, and so forth.

Shiny New Techniques

The gap between what is theoretically possible in data analysis and what is actually done is probably widening. Companies enthusiastically take up the “Big Data” mantra – hiring “Chief Data Scientists.” I noticed with amusement an article in a trade magazine quoting an executive who wondered whether hiring a data scientist was something like hiring a unicorn.

There is a lot of data out there, more all the time. More and more data is becoming accessible with expansion of storage capabilities and of course storage in the cloud.

And really the range of new techniques is dazzling.

I’m thinking, for example, of bagging and boosting forecast models. Or of the techniques that can be deployed for the problem of “many predictors,” techniques including principal component analysis, ridge regression, the lasso, and partial least squares.

Probably one of the areas where these new techniques come into their own is in target marketing. Target marketing is kind of a reworking of forecasting. As in forecasting sales generally, you identify key influences (“drivers and triggers”) on the sale of a product, usually against survey data or past data on customers and their purchases. Typically, there is a higher degree of disaggregation, often to the customer level, than in standard forecasting.

When you are able to predict sales to a segment of customers, or to customers with certain characteristics, you then are ready for the sales campaign to this target group. Maybe a pricing decision is involved, or development of a product with a particular mix of features. Advertising, where attitudinal surveys supplement customer demographics and other data, is another key area.

Related Areas

Many of the same techniques, perhaps with minor modifications, are applicable to other areas for what has come to be called “predictive analytics.”

The medical/health field has a growing list of important applications. As this blog tries to show, quantitative techniques, such as logistic regression, have a lot to offer medical diagnostics. I think the extension of predictive analytics to medicine and health care ism at this point, merely a matter of access to the data. This is low-hanging fruit. Physicians diagnosing a guy with an enlarged prostate and certain PSA and other metrics should be able to consult a huge database for similarities with respect to age, health status, collateral medical issues and so forth. There is really no reason to suspect that normally bright, motivated people who progress through medical school and come out to practice should know the patterns in 100,000 medical records of similar cases throughout the nation, or have read all the scientific articles on that particular niche. While there are technical and interpretive issues, I think this corresponds well to what Nate Silver identifies as promising – areas where application of a little quantitative analysis and study can reap huge rewards.

And cancer research is coming to be closely allied with predictive analytics and data science. The paradigmatic application is the DNA assay, where a sample of a tumor is compared with healthy tissue from the same individual to get an idea of what cancer configuration is at play. Indeed, at that fine new day when big pharma will develop hundreds of genetically targeted therapies for people with a certain genetic makeup with a certain cancer – when that wonderful new day comes – cancer treatment may indeed go hand in hand with mathematical analysis of the patient’s makeup.

# Bootstrapping

I’ve been reading about the bootstrap. I’m interested in bagging or bootstrap aggregation.

The primary task of a statistician is to summarize a sample based study and generalize the finding to the parent population in a scientific manner..

The purpose of a sample study is to gather information cheaply in a timely fashion. The idea behind bootstrap is to use the data of a sample study at hand as a “surrogate population”, for the purpose of approximating the sampling distribution of a statistic; i.e. to resample (with replacement) from the sample data at hand and create a large number of “phantom samples” known as bootstrap samples. The sample summary is then computed on each of the bootstrap samples (usually a few thousand). A histogram of the set of these computed values is referred to as the bootstrap distribution of the statistic.

These well-phrased quotes come from Bootstrap: A Statistical Method by Singh and Xie.

OK, so let’s do a simple example.

Suppose we generate ten random numbers, drawn independently from a Gaussian or normal distribution with a mean of 10 and standard deviation of 1.

This sample has an average of 9.7684. We would like to somehow project a 95 percent confidence interval around this sample mean, to understand how close it is to the population average.

So we bootstrap this sample, drawing 10,000 samples of ten numbers with replacement.

Here is the distribution of bootstrapped means of these samples.

The mean is 9.7713.

Based on the method of percentiles, the 95 percent confidence interval for the sample mean is between 9.32 and 10.23, which, as you note, correctly includes the true mean for the population of 10.

Bias-correction is another primary use of the bootstrap. For techies, there is a great paper from the old Bell Labs called A Real Example That Illustrates Properties of Bootstrap Bias Correction. Unfortunately, you have to pay a fee to the American Statistical Association to read it – I have not found a free copy on the Web.

In any case, all this is interesting and a little amazing, but what we really want to do is look at the bootstrap in developing forecasting models.

Bootstrapping Regressions

There are several methods for using bootstrapping in connection with regressions.

One is illustrated in a blog post from earlier this year. I treated the explanatory variables as variables which have a degree of randomness in them, and resampled the values of the dependent variable and explanatory variables 200 times, finding that doing so “brought up” the coefficient estimates, moving them closer to the underlying actuals used in constructing or simulating them.

This method works nicely with hetereoskedastic errors, as long as there is no autocorrelation.

Another method takes the explanatory variables as fixed, and resamples only the residuals of the regression.

Bootstrapping Time Series Models

The underlying assumptions for the standard bootstrap include independent and random draws.

This can be violated in time series when there are time dependencies.

Of course, it is necessary to transform a nonstationary time series to a stationary series to even consider bootstrapping.

But even with a time series that fluctuates around a constant mean, there can be autocorrelation.

So here is where the block bootstrap can come into play. Let me cite this study – conducted under the auspices of the Cowles Foundation (click on link) – which discusses the asymptotic properties of the block bootstrap and provides key references.

There are many variants, but the basic idea is to sample blocks of a time series, probably overlapping blocks. So if a time series yt  has n elements, y1,..,yn and the block length is m, there are n-m blocks, and it is necessary to use n/m of these blocks to construct another time series of length n. Issues arise when m is not a perfect divisor of n, and it is necessary to develop special rules for handling the final values of the simulated series in that case.

Block bootstrapping is used by Bergmeir, Hyndman, and Benıtez in bagging exponential smoothing forecasts.

How Good Are Bootstrapped Estimates?

Consistency in statistics or econometrics involves whether or not an estimate or measure converges to an unbiased value as sample size increases – or basically goes to infinity.

This is a huge question with bootstrapped statistics, and there are new findings all the time.

Interestingly, sometimes bootstrapped estimates can actually converge faster to the appropriate unbiased values than can be achieved simply by increasing sample size.

And some metrics really do not lend themselves to bootstrapping.

Also some samples are inappropriate for bootstrapping.  Gelman, for example, writes about the problem of “separation” in a sample

[In} ..an example of a poll from the 1964 U.S. presidential election campaign, … none of the black respondents in the sample supported the Republican candidate, Barry Goldwater… If zero black respondents in the sample supported Barry Goldwater, then zero black respondents in any bootstrap sample will support Goldwater as well. Indeed, bootstrapping can exacerbate separation by turning near-separation into complete separation for some samples. For example, consider a survey in which only one or two of the black respondents support the Republican candidate. The resulting logistic regression estimate will be noisy but it will be finite.

Here is a video doing a good job of covering the bases on boostrapping. I suggest sampling portions of it first. It’s quite good, but it may seem too much going into it.

# Bagging Exponential Smoothing Forecasts

Bergmeir, Hyndman, and Benıtez (BHB) successfully combine two powerful techniques – exponential smoothing and bagging (bootstrap aggregation) – in ground-breaking research.

I predict the forecasting system described in Bagging Exponential Smoothing Methods using STL Decomposition and Box-Cox Transformation will see wide application in business and industry forecasting.

These researchers demonstrate their algorithms for combining exponential smoothing and bagging outperform all other forecasting approaches in the M3 forecasting competition database for monthly time series, and do better than many approaches for quarterly and annual data. Furthermore, the BHB approach can be implemented with extant routines in the programming language R.

This table compares bagged exponential smoothing with other approaches on monthly data from the M3 competition.

Here BaggedETS.BC refers to a variant of the bagged exponential smoothing model which uses a Box Cox transformation of the data to reduce the variance of model disturbances, The error metrics are the symmetric mean absolute percentage error (sMAPE) and the mean absolute scaled error (MASE). These are calculated for applications of the various models to out-of-sample, holdout, or test sample data from each of 1428 monthly time series in the competition.

See the online text by Hyndman and Athanasopoulos for motivations and discussions of these error metrics.

The BHB Algorithm

In a nutshell, here is the BHB description of their algorithm.

After applying a Box-Cox transformation to the data, the series is decomposed into trend, seasonal and remainder components. The remainder component is then bootstrapped using the MBB, the trend and seasonal components are added back, and the Box-Cox transformation is inverted. In this way, we generate a random pool of similar bootstrapped time series. For each one of these bootstrapped time series, we choose a model among several exponential smoothing models, using the bias-corrected AIC. Then, point forecasts are calculated using all the different models, and the resulting forecasts are averaged.

The MBB is the moving block bootstrap. It involves random selection of blocks of the remainders or residuals, preserving the time sequence and, hence, autocorrelation structure in these residuals.

Several R routines supporting these algorithms have previously been developed by Hyndman et al. In particular, the ets routine developed by Hyndman and Khandakar fits 30 different exponential smoothing models to a time series, identifying the optimal model by an Akaike information criterion.

Some Thoughts

This research lays out an almost industrial-scale effort to extract more information for prediction purposes from time series, and at the same time to use an applied forecasting workhorse – exponential smoothing.

Exponential smoothing emerged as a forecasting technique in applied contexts in the 1950’s and 1960’s. The initial motivation was error correction from forecasts of arbitrary origin, instead of an underlying stochastic model. Only later were relationships between exponential smoothing and time series processes, such as random walks, revealed with the work of Muth and others.

The M-competitions, initially organized in the 1970’s, gave exponential smoothing a big boost, since, by some accounts, exponential smoothing “won.” This is one of the sources of the meme – simpler models beat more complex models.

Then, at the end of the 1990’s, Makridakis and others organized a penultimate M-competition which was, in fact, won by the automatic forecasting software program Forecast Pro. This program typically compares ARIMA and exponential smoothing models, picking the best model through proprietary optimization of the parameters and tests on holdout samples. As in most sales and revenue forecasting applications, the underlying data are time series.

While all this was going on, the machine learning community was ginning up new and powerful tactics, such as bagging or bootstrap aggregation. Bagging can be a powerful technique for focusing on parameter estimates which are otherwise masked by noise.

So this application and research builds interestingly on a series of efforts by Hyndman and his associates and draws in a technique that has been largely confined to machine learning and data mining.

It is really almost the first of its kind – where bagging applications to time series forecasting have been less spectacularly successful than in cross-sectional regression modeling, for example.

A future post here will go through the step-by-step of this approach using some specific and familiar time series from the M competition data.

# Complete Subset Regressions

A couple of years or so ago, I analyzed a software customer satisfaction survey, focusing on larger corporate users. I had firmagraphics – specifying customer features (size, market segment) – and customer evaluation of product features and support, as well as technical training. Altogether, there were 200 questions that translated into metrics or variables, along with measures of customer satisfaction. Altogether, the survey elicited responses from about 5000 companies.

Now this is really sort of an Ur-problem for me. How do you discover relationships in this sort of data space? How do you pick out the most important variables?

Since researching this blog, I’ve learned a lot about this problem. And one of the more fascinating approaches is the recent development named complete subset regressions.

And before describing some Monte Carlo exploring this approach here, I’m pleased Elliot, Gargano, and Timmerman (EGT) validate an intuition I had with this “Ur-problem.” In the survey I mentioned above, I calculated a whole bunch of univariate regressions with customer satisfaction as the dependent variable and each questionnaire variable as the explanatory variable – sort of one step beyond calculating simple correlations. Then, it occurred to me that I might combine all these 200 simple regressions into a predictive relationship. To my surprise, EGT’s research indicates that might have worked, but not be as effective as complete subset regression.

Complete Subset Regression (CSR) Procedure

As I understand it, the idea behind CSR is you run regressions with all possible combinations of some number r less than the total number n of candidate or possible predictors. The final prediction is developed as a simple average of the forecasts from these regressions with r predictors. While some of these regressions may exhibit bias due to specification error and covariance between included and omitted variables, these biases tend to average out, when the right number r < n is selected.

So, maybe you have a database with m observations or cases on some target variable and n predictors.

And you are in the dark as to which of these n predictors or potential explanatory variables really do relate to the target variable.

That is, in a regression y = β01 x1 +…+βn xn some of the beta coefficients may in fact be zero, since there may be zero influence between the associated xi and the target variable y.

Of course, calling all the n variables xi i=1,…n “predictor variables” presupposes more than we know initially. Some of the xi could in fact be “irrelevant variables” with no influence on y.

In a nutshell, the CSR procedure involves taking all possible combinations of some subset r of the n total number of potential predictor variables in the database, and mapping or regressing all these possible combinations onto the dependent variable y. Then, for prediction, an average of the forecasts of all these regressions is often a better predictor than can be generated by other methods – such as the LASSO or bagging.

EGT offer a time series example as an empirical application. based on stock returns, quarterly from 1947-2010 and twelve (12) predictors. The authors determine that the best results are obtained with a small subset of the twelve predictors, and compare these results with ridge regression, bagging, Lasso and Bayesian Model Averaging.

The article in The Journal of Econometrics is well-worth purchasing, if you are not a subscriber. Otherwise, there is a draft in PDF format from 2012.

The combination of n things taken r at a time is n!/[(n-r)!(r!)] and increases faster than exponentially, as n increases. For large n, accordingly, it is necessary to sample from the possible set of combinations – a procedure which still can generate improvements in forecast accuracy over a “kitchen sink” regression (under circumstances further delineated below). Otherwise, you need a quantum computer to process very fat databases.

When CSR Works Best – Professor Elloitt

I had email correspondence with Professor Graham Elliott, one of the co-authors of the above-cited paper in the Journal of Econometrics.

His recommendation is that CSR works best with when there are “weak predictors” sort of buried among a superset of candidate variables,

If a few (say 3) of the variables have large coefficients such as that they result in a relatively large R-square for the prediction regression when they are all included, then CSR is not likely to be the best approach. In this case model selection has a high chance of finding a decent model, the kitchen sink model is not all that much worse (about 3/T times the variance of the residual where T is the sample size) and CSR is likely to be not that great… When there is clear evidence that a predictor should be included then it should be always included…, rather than sometimes as in our method. You will notice that in section 2.3 of the paper that we construct properties where beta is local to zero – what this math says in reality is that we mean the situation where there is very little clear evidence that any predictor is useful but we believe that some or all have some minor predictive ability (the stock market example is a clear case of this). This is the situation where we expect the method to work well. ..But at the end of the day, there is no perfect method for all situations.

I have been toying with “hidden variables” and, then, measurement error in the predictor variables in simulations that further validate Graham Elliot’s perspective that CSR works best with “weak predictors.”

Monte Carlo Simulation

Here’s the spreadsheet for a relevant simulation (click to enlarge).

It is pretty easy to understand this spreadsheet, but it may take a few seconds. It is a case of latent variables, or underlying variables disguised by measurement error.

The z values determine the y value. The z values are multiplied by the bold face numbers in the top row, added together, and then the epsilon error ε value is added to this sum of terms to get each y value. You have to associate the first bold face coefficient with the first z variable, and so forth.

At the same time, an observer only has the x values at his or her disposal to estimate a predictive relationship.

These x variables are generated by adding a Gaussian error to the corresponding value of the z variables.

Note that z5 is an irrelevant variable, since its coefficient loading is zero.

This is a measurement error situation (see the lecture notes on “measurement error in X variables” ).

The relationship with all six regressors – the so-called “kitchen-sink” regression – clearly shows a situation of “weak predictors.”

I consider all possible combinations of these 6 variables, taken 3 at a time, or 20 possible distinct combinations of regressors and resulting regressions.

In terms of the mechanics of doing this, it’s helpful to set up the following type of listing of the combinations.

Each digit in the above numbers indicates a variable to include. So 123 indicates a regression with y and x1, x2, and x3. Note that writing the combinations in this way so they look like numbers in order of increasing size can be done by a simple algorithm for any r and n.

And I can generate thousands of cases by allowing the epsilon ε values and other random errors to vary.

In the specific run above, the CSR average soundly beats the mean square error (MSE) of this full specification in forecasts over ten out-of-sample values. The MSE of the kitchen sink regression, thus, is 2,440 while the MSE of the regression specifying all six regressors is 2653. It’s also true that picking the lowest within-sample MSE among the 20 possible combinations for k = 3 does not produce a lower MSE in the out-of-sample run.

This is characteristics of results in other draws of the random elements. I hesitate to characterize the totality without further studying the requirements for the number of runs, given the variances, and so forth.

I think CSR is exciting research, and hope to learn more about these procedures and report in future posts.

# Random Subspace Ensemble Methods (Random Forest™ algorithm)

As a term, random forests apparently is trademarked, which is, in a way, a shame because it is so evocative – random forests, for example, are comprised of a large number of different decision or regression trees, and so forth.

Whatever the name we use, however, the Random Forest™ algorithm is a powerful technique. Random subspace ensemble methods form the basis for several real world applications, such as Microsoft’s Kinect, facial recognition programs in cell phone and other digital cameras, and figure importantly in many Kaggle competitions, according to Jeremy Howard, formerly Kaggle Chief Scientist.

I assemble here a Howard talk from 2011 called “Getting In Shape For The Sport Of Data Science” and instructional videos from a data science course at the University of British Columbia (UBC). Watching these involves a time commitment, but it’s possible to let certain parts roll and then to skip ahead. Be sure and catch the last part of Howard’s talk, since he’s good at explaining random subspace ensemble methods, aka random forests.

It certainly helps me get up to speed to watch something, as opposed to reading papers on a fairly unfamiliar combination of set theory and statistics.

By way of introduction, the first step is to consider a decision tree. One of the UBC videos notes that decision trees faded from popularity some decades ago, but have come back with the emergence of ensemble methods.

So a decision tree is a graph which summarizes the classification of multi-dimensional points in some space, usually based on creating rectangular areas with reference to the coordinates. The videos make this clearer.

So this is nice, but decision trees of this sort tend to over-fit; they may not generalize very well. There are methods of “pruning” or simplification which can help generalization, but another tactic is to utilize ensemble methods. In other words, develop a bunch of decision trees classifying some set of multi-attribute items.

Random forests simply build such decision trees with a randomly selected group of attributes, subsets of the total attributes defining the items which need to be classified.

The idea is to build enough of these weak predictors and then average to arrive at a modal or “majority rule” classification.

Here’s the Howard talk.

Then, there is an introductory UBC video on decision trees

This video goes into detail on the method of constructing random forests.

Then the talk on random subspace ensemble applications.

# Hal Varian and the “New” Predictive Techniques

Big Data: New Tricks for Econometrics is, for my money, one of the best discussions of techniques like classification and regression trees, random forests, and penalized  regression (such as lasso, lars, and elastic nets) that can be found.

Varian, pictured aove, is emeritus professor in the School of Information, the Haas School of Business, and the Department of Economics at the University of California at Berkeley. Varian retired from full-time appointments at Berkeley to become Chief Economist at Google.

He also is among the elite academics publishing in the area of forecasting according to IDEAS!.

Big Data: New Tricks for Econometrics, as its title suggests, uses the wealth of data now being generated (Google is a good example) as a pretext for promoting techniques that are more well-known in machine learning circles, than in econometrics or standard statistics, at least as understood by economists.

First, the sheer size of the data involved may require more sophisticated 18 data manipulation tools. Second, we may have more potential predictors than appropriate for estimation, so we need to do some kind of variable selection. Third, large data sets may allow for more flexible relationships than simple linear models. Machine learning techniques such as decision trees, support vector machines, neural nets, deep learning and so on may allow for more effective ways to model complex relationships.

He handles the definitional stuff deftly, which is good, since there is not standardization of terms yet in this rapidly evolving field of data science or predictive analytics, whatever you want to call it.

Thus, “NoSQL” databases are

sometimes interpreted as meaning “not only SQL.” NoSQL databases are more primitive than SQL databases in terms of data manipulation capabilities but can handle larger amounts of data.

The essay emphasizes out-of-sample prediction and presents a nice discussion of k-fold cross validation.

1. Divide the data into k roughly equal subsets and label them by s =1; : : : ; k. Start with subset s = 1.

2. Pick a value for the tuning parameter.

3. Fit your model using the k -1 subsets other than subset s.

4. Predict for subset s and measure the associated loss.

5. Stop if s = k, otherwise increment s by 1 and go to step 2.

Common choices for k are 10, 5, and the sample size minus 1 (“leave one out”). After cross validation, you end up with k values of the tuning parameter and the associated loss which you can then examine to choose an appropriate value for the tuning parameter. Even if there is no tuning parameter, it is useful to use cross validation to report goodness-of-t measures since it measures out-of-sample performance which is what is typically of interest.

Varian remarks that Test-train and cross validation, are very commonly used in machine learning and, in my view, should be used much more in economics, particularly when working with large datasets

But this essay is by no means methodological, and presents several nice worked examples, showing how, for example, regression trees can outperform logistic regression in analyzing survivors of the sinking of the Titanic – the luxury ship, and how several of these methods lead to different imputations of significance to the race factor in the Boston Housing Study.

The essay also presents easy and good discussions of bootstrapping, bagging, boosting, and random forests, among the leading examples of “new” techniques – new to economists.

For the statistics wonks, geeks, and enthusiasts among readers, here is a YouTube presentation of the paper cited above with extra detail.

# Analytics 2013 Conference in Florida

Looking for case studies of data analytics or predictive analytics, or for Big Data applications?

You can hardly do better, on a first cut, than peruse the material now available from October’s Analytics 2013 Conference, held at the Hyatt Regency Hotel in Orlando, Florida.

Presented by SAS, dozens of presentations and posters from the Conference can be downloaded as zip files, unbundling as PDF files.

Download the conference presentations and poster presentations (.zip)

I also took an hour to look at the Keynote Presentation of Dr. Sven Crone of Lancaster University in the UK, now available on YouTube.

Crone, who also is affiliated with the Lancaster Centre for Forecasting, gave a Keynote which was, in places, fascinating, and technical and a little obscure elsewhere – worth watching if you time, or can run it in the background while you sort through your desk, for example.

A couple of slides caught my attention.

One segment gave concrete meaning to the explosion of data available to forecasters and analysts. For example, for electric power load forecasting, it used be the case that you had, perhaps, monthly total loads for the system or several of its parts, or perhaps daily system loads. Now, Crone notes the data to be modeled has increased by orders of magnitude, for example, with Smart Meters recording customer demand at fifteen minute intervals.

Another part of Crone’s talk which grabbed my attention was his discussion of forecasting techniques employed by 300 large manufacturing concerns, some apparently multinational in scale. The following graph – which is definitely obscure by virtue of its use of acronyms for types of forecasting systems, like SOP for Sales and Operation Planning – highlights that almost no company uses anything except the simplest methods for forecasting, relying largely on judgmental approaches. This aligns with a survey I once did which found almost no utilities used anything except the simplest per capita forecasting approaches. Perhaps things have changed now.

Crone suggests relying strictly on judgment becomes sort of silly in the face of the explosion of information now available to management.

Another theme Crone spins in an amusing, graphic way is that the workhorses of business forecasting, such as exponential smoothing, are really products from many decades ago. He uses funny pics of old business/office environments, asking whether this characterizes your business today.

The analytic meat of the presentation comes with exposition of bagging and boosting, as well as creative uses for k-means clustering in time series analysis.

At which point he descends into a technical wonderland of complexity.

Incidentally, Analytics 2014 is scheduled for Frankfurt, Germany June 4-5 this coming Spring.

Watch here for my follow-on post on boosting time series.

# The On-Coming Tsunami of Data Analytics

More than 25,000 visited businessforecastblog, March 2012-December 2013, some spending hours on the site. Interest ran nearly 200 visitors a day in December, before my ability to post was blocked by a software glitch, and we did this re-boot.

Now I have hundreds of posts offline, pertaining to several themes, discussed below. How to put this material back up – as reposts, re-organized posts, or as longer topic summaries?

There’s a silver lining. This forces me to think through forecasting, predictive and data analytics.

One thing this blog does is compile information on which forecasting and data analytics techniques work, and, to some extent, how they work, how key results are calculated. I’m big on computation and performance metrics, and I want to utilize the SkyDrive more extensively to provide full access to spreadsheets with worked examples.

Often my perspective is that of a “line worker” developing sales forecasts. But there is another important focus – business process improvement. The strength of a forecast is measured, ultimately, by its accuracy. Efforts to improve business processes, on the other hand, are clocked by whether improvement occurs – whether costs of reaching customers are lower, participation rates higher, customer retention better or in stabilization mode (lower churn), and whether the executive suite and managers gain understanding of who the customers are. And there is a third focus – that of the underlying economics, particularly the dynamics of the institutions involved, such as the US Federal Reserve.

Right off, however, let me say there is a direct solution to forecasting sales next quarter or in the coming budget cycle. This is automatic forecasting software, with Forecast Pro being one of the leading products. Here’s a YouTube video with the basics about that product.

You can download demo versions and participate in Webinars, and attend the periodic conferences organized by Business Forecast Systems showcasing user applications in a wide variety of companies.

So that’s a good solution for starters, and there are similar products, such as the SAS/ETS time series software, and Autobox.

So what more would you want?

Well, there’s need for background information, and there’s a lot of terminology. It’s useful to know about exponential smoothing and random walks, as well as autoregressive and moving averages.  Really, some reaches of this subject are arcane, but nothing is worse than a forecast setup which gains the confidence of stakeholders, and then falls flat on its face. So, yes, eventually, you need to know about “pathologies” of the classic linear regression (CLR) model – heteroscedasticity, autocorrelation, multicollinearity, and specification error!

And it’s good to gain this familiarity in small doses, in connection with real-world applications or even forecasting personalities or celebrities. After a college course or two, it’s easy to lose track of concepts. So you might look at this blog as a type of refresher sometimes.

Anticipating Turning Points in Time Series

But the real problem comes with anticipating turning points in business and economic time series. Except when modeling seasonal variation, exponential smoothing usually shoots over or under a turning point in any series it is modeling.

If this were easy to correct, macroeconomic forecasts would be much better. The following chart highlights the poor performance, however, of experts contributing to the quarterly Survey of Professional Forecasters, maintained by the Philadelphia Fed.

So, the red line is the SPF consensus forecast for GDP growth on a three quarter horizon, and the blue line is the forecast or nowcast for the current quarter (there is a delay in release of current numbers). Notice the huge dips in the current quarter estimate, associated with four recessions 1981, 1992, 2001-2, and 2008-9. A mere three months prior to these catastrophic drops in growth, leading forecasters at big banks, consulting companies, and universities totally missed the boat.

This is important in a practical sense, because recessions turn the world of many businesses upside down. All bets are off. The forecasting team is reassigned or let go as an economy measure, and so forth.

Some forward-looking information would help business intelligence focus on reallocating resources to sustain revenue as much as possible, using analytics to design cuts exerting the smallest impact on future ability to maintain and increase market share.

Hedgehogs and Foxes

Nate Silver has a great table in his best-selling The
Signal and the Noise
on the qualities and forecasting performance of hedgehogs and foxes. The idea comes from a Greek poet, “The fox knows many little things, but the hedgehog knows one big thing.”

Following Tetlock, Silver finds foxes are multidisplinary, adaptable, self-critical, cautious, and empirical, tolerant of complexity. By contrast, the Hedgehog is specialized, sticks to the same approaches, stubbornly adheres to his model in spite of counter-evidence, is order-seeking, confident, and ideological. The evidence suggests foxes generally outperform hedgehogs, just as ensemble methods typically outperform a single technique in forecasting.

Message – be a fox.

So maybe this can explain some of the breadth of this blog. If we have trouble predicting GDP growth, what about forecasts in other areas – such as weather, climate change, or that old chestnut, sun spots? And maybe it is useful to take a look at how to forecast all the inputs and associated series – such as exchange rates, growth by global region, the housing market, interest rates, as well as profits.

And while we are looking around, how about brain waves? Can brain waves be forecast? Oh yes, it turns out there is a fascinating and currently applied new approach called neuromarketing, which uses headbands and electrodes, and even MRI machines, to detect deep responses of consumers to new products and advertising.

New Methods

I know I have not touched on cluster analysis and classification, areas making big contributions to improvement of business process. But maybe if we consider the range of “new” techniques for predictive analytics, we can see time series forecasting and analysis of customer behavior coming under one roof.

There is, for example, this many predictor thread emerging in forecasting in the late 1990’s and especially in the last decade with factor models for macroeconomic forecasting. Reading this literature, I’ve become aware of methods for mapping N explanatory variables onto a target variable, when there are M<N observations. These are sometimes called methods of data shrinkage, and include principal components regression, ridge regression, and the lasso. There are several others, and a good reference is The Elements of Statistical Learning, Data Mining, Learning and Prediction, 2nd edition, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. This excellent text is downloadable, accessible via the Tools, Apps, Texts, Free Stuff menu option located just to the left of the search utility on the heading for this blog.

There also is bagging, which is the topic of the previous post, as well as boosting, and a range of decision tree and regression tree modeling tactics, including random forests.

I’m actively exploring a number of these approaches, ginning up little examples to see how they work and how the computation goes. So far, it’s impressive. This stuff can really improve over the old approaches, which someone pointed out, have been around since the 1950’s at least.

It’s here I think that we can sight the on-coming wave, just out there on the horizon – perhaps hundreds of feet high. It’s going to swamp the old approaches, changing market research forever and opening new vistas, I think, for forecasting, as traditionally understood.

I hope to be able to ride that wave, and now I put it that way, I get a sense of urgency in keeping practicing my web surfing.

Hope you come back and participate in the comments section, or email me at cvj@economicdataresources.com

# Forecasting in Data-limited Situations – A New Day

Over the Holidays – while frustrated in posting by a software glitch – I looked at the whole “shallow data issue” in light of  a new technique I’ve learned called bagging.

Bottom line, using spreadsheet simulations, I can show bagging radically reduces out-of-sample forecast error, in a situation typical for a lot business forecasting – where there are just a few workable observations, quite a few candidate drivers or explanatory variables, and a lot of noise in the data.

Here is a comparison of the performance of OLS regression and bagging with out-of-sample data generated with the same rules which create the “sample data” in the example spreadsheet shown below.

The contrast is truly stark. Although, as we will see, the ordinary least squares (OLS) regression has an R2 or “goodness of fit” of 0.99, it does not generalize well out-of-sample, producing the purple line in the graph with 12 additional cases or observations. Bagging the original sample 200 times and re-estimating OLS regression on the bagged samples, then averaging the regression constants and coefficients, produces a much tighter fit on these out-of-sample observations.

Example Spreadsheet

The spreadsheet below illustrates 12 “observations” on a  TARGET or dependent variable and nine (9) explanatory variables, x1 through x9.

The top row with numbers in red lists the “true” values of these explanatory variables or drivers, and the column of numbers in red on the far right are the error terms (which are generated by a normal distribution with zero mean and standard deviation of 50).

So if we multiply 3 times 0.22 and add -6 times -2.79 and so forth, adding 68.68 at the end, we get the first value of the TARGET variable 60.17.

While this example is purely artificial, an artifact, one can imagine that these numbers are first differences – that is the current value of a variable minus its preceding value. Thus, the TARGET variable might record first differences in sales of a product quarter by quarter. And we suppose forecasts for  x1 through x9 are available, although not shown above. In fact, they are generated in simulations with the same generating mechanisms utilized to create the sample.

Using the simplest multivariate approach, the ordinary least squares (OLS) regression, displayed in the Excel format, is –

There’s useful information in this display, often the basis of a sort of “talk-through” the regression result. Usually, the R2 is highlighted, and it is terrific here, “explaining” 99 percent of the variation in the data, in, that is, the 12 in-sample values for the TARGET variable. Furthermore, four explanatory variables have statistically significant coefficients, judged by their t-statistics – x2, x6, x7, and x9. These are highlighted in a kind of purple in the display.

Of course, the estimated values of x1 through x9 are, for the most part, numerically quite different than the true values of the constant term and coefficients {10, 3, -6, 0.5, 15, 1, -1, -5, 0.25, 1}. Nevertheless, because of the large variances or standard errors of the estimates, as noted above some estimated coefficients are within a 95 percent confidence interval of these true values. It’s just that the confidence intervals are very wide.

The in-sample predicted values are accurate, generally speaking. These loopy coefficient estimates essentially balance one another off in-sample.

But it’s not the in-sample performance we are interested in, but the out-of-sample performance. And we want to compare the out-of-sample performance of this OLS regression estimate with estimates of the coefficients and TARGET variable produced by ridge regression and bagging.

Bagging

Bagging [bootstrap aggregating] was introduced by Breiman in the 1990’s to reduce the variance of predictors. The idea is that you take N bootstrap samples of the original data, and with each of these samples, estimate your model, creating, in the end, an ensemble prediction.

Bootstrap sampling draws random samples with replacement from the original sample, creating other samples of the same size. With 12 cases or observations on the TARGET and explanatory variables there are a large number of possible random samples of these 12 cases drawn with replacement; in fact, given nine explanatory variables and the TARGET variable, there are 129  or somewhat more than 5 billion distinct samples, 12 of which, incidentally, are comprised of exactly the same case drawn repeatedly from the original sample.

A primary application of bagging has been in improving the performance of decision trees and systems of classification. Applications to regression analysis seem to be more or less an after-thought in the literature, and the technique does not seem to be in much use in applied business forecasting contexts.

Thus, in the spreadsheet above, random draws with replacement are taken of the twelve rows of the spreadsheet (TARGET and drivers) 200 times, creating 200 samples. An ordinary least squares regression is estimated over each regression, and the constant and parameter estimates are averaged at the end of the process.

Here is a comparison of the estimated coefficients from Bagging and OLS, compared with the true values.

There’s still variation of the parameter estimates from the true values with bagging, but the variance of the error process (50) is, by design, high. For example, most of the value of TARGET is from the error process, so this is noisy data.

Discussion

Some questions. For example – Are there specific features of the problem presented here which tip the results markedly in favor of bagging? What are the criteria for determining whether bagging will improve regression forecasts? Another question regards the ease or difficulty of bagging regressions in Excel.

The criterion for bagging to deliver dividends is basically parameter instability over the sample. Thus, in the problem here, deleting any observation from the 12 cases and re-estimating the regression results in big changes to estimated parameters. The basic reason is the error terms constitute by far the largest contribution to the value of TARGET for each case.

In practical forecasting, this criterion, which not very clearly defined, can be explored, and then comparisons with regard to actual outcomes can be studied. Thus, estimate the bagged regression forecast,  wait a period, and compare bagged and simple OLS forecasts. Substantial improvement in forecast accuracy, combined with parameter instability in the sample, would seem to be a smoking gun.

Apart from the large contribution of the errors or residuals to the values of TARGET, the other distinctive feature of the problem presented here is the large number of predictors in comparison with the number of cases or observations. This, in part, accounts for the high coefficient of determination or R2, and also suggests that the close in-sample fit and poor out-of-sample performance are probably related to “over-fitting.”