First Cut Modeling – All Possible Regressions

If you can, form the regression

Y = β0+ β1X1+ β2X2+…+ βNXN

where Y is the target variable and the N variagles Xi are the predictors which have the highest correlations with the target variables, based on some cutoff value of the correlation, say +/- 0.3.

Of course, if the number of observations you have in the data are less than N, you can’t estimate this OLS regression. Some “many predictors” data shrinkage or dimension reduction technique is then necessary – and will be covered in subsequent posts.

So, for this discussion, assume you have enough data to estimate the above regression.

Chances are that the accompanying measures of significance of the coefficients βi – the t-statistics or standard errors – will indicate that only some of these betas are statistically significant.

And, if you poke around some, you probably will find that it is possible to add some of the predictors which showed low correlation with the target variable and have them be “statistically significant.”

So this is all very confusing. What to do?

Well, if the number of predictors is, say, on the order of 20, you can, with modern computing power, simply calculate all possible regressions with combinations of these 20 predictors. That turns out to be around 1 million regressions (210 – 1). And you can reduce this number by enforcing known constraints on the betas, e.g. increasing family income should be unambiguously related to the target variable and, so, if its sign in a regression is reversed, throw that regression out from consideration.

The statistical programming language R has packages set up to do all possible regressions. See, for example, Quick-R which offers this useful suggestion –

leapsBut what other metrics, besides R2, should be used to evaluate the possible regressions?

In-Sample Regression Metrics

I am not an authority on the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC), which, in addition, to good old R2, are leading in-sample metrics for regression adequacy.

With this disclaimer, here are a few points about the AIC and BIC.


So, as you can see, both the AIC and BIC are functions of the mean square error (MSE), as well as the number of predictors in the equation and the sample size. Both metrics essentially penalize models with a lot of explanatory variables, compared with other models that might perform similarly with fewer predictors.

  • There is something called the AIC-BIC dilemma. In a valuable reference on variable selection, Serena Ng writes that the AIC is understood to fall short when it comes to consistent model selection. Hyndman, in another must-read on this topic, writes that because of the heavier penalty, the model chosen by BIC is either the same as that chosen by AIC, or one with fewer terms.

Consistency in discussions of regression methods relates to the large sample properties of the metric or procedure in question. Basically, as the sample size n becomes indefinitely large (goes to infinity) consistent estimates or metrics converge to unbiased values. So the AIC is not in every case consistent, although I’ve read research which suggests that the problem only arises in very unusual setups.

  • In many applications, the AIC and BIC can both be minimum for a particular model, suggesting that this model should be given serious consideration.

Out-of-Sample Regression Metrics

I’m all about out-of-sample (OOS) metrics of adequacy of forecasting models.

It’s too easy to over-parameterize models and come up with good testing on in-sample data.

So I have been impressed with endorsements such as that of Hal Varian of cross-validation.

So, ideally, you partition the sample data into training and test samples. You estimate the predictive model on the training sample, and then calculate various metrics of adequacy on the test sample.

The problem is that often you can’t really afford to give up that much data to the test sample.

So cross-validation is one solution.

In k-fold cross validation, you partition the sample into k parts, estimating the designated regression on data from k-1 of those segments, and using the other or kth segment to test the model. Do this k times and then average or somehow collate the various error metrics. That’s the drill.,

Again, Quick-R suggests useful R code.

Hyndman also highlights a handy matrix formula to quickly compute the Leave Out One Cross Validation (LOOCV) metric.


LOOCV is not guaranteed to find the true model as the sample size increases, i.e. it is not consistent.

However, k-fold cross-validation can be consistent, if k increases with sample size.

Researchers recently have shown, however, that LOOCV can be consistent for the LASSO.

Selecting regression variables is, indeed, a big topic.

Coming posts will focus on the problem of “many predictors” when the set of predictors is greater in number than the set of observations on the relevant variables.

Top image from Washington Post

Selecting Predictors – the Specification Problem

I find toy examples helpful in exploratory work.

So here is a toy example showing the pitfalls of forward selection of regression variables, in the presence of correlation between predictors. In other words, this is an example of the specification problem.

Suppose the true specification or regression is –

y = 20x1-11x2+10x3

and the observations on x2 and x3 in the available data are correlated.

To produce examples of this system, I create columns of random numbers in which the second and third columns are correlated with a correlation coefficient of around 0.6. I also add a random error term with zero mean and constant variance of 10. Then, after generating the data and the error terms, I apply the coefficients indicated above and estimate values for the dependent variable y.

Then, specifying all three variables,  x1, x2, and x3, I estimate regressions which characteristically have coefficient values not far from the (20,-11, 10), such as,

spregThis, of course, is a regression output from Microsoft Excel, where I developed this simple Monte Carlo simulation which has 40 “observations.”

If you were lucky enough to estimate this regression initially, you well might stop and not bother about dropping variables to estimate other potentially competing models.

However, if you start with fewer variables, you encounter a significant difficulty.

Here is the distribution of x2 in repeated estimates of a regression with explanatory variables x1 and x2 –


As you can see, the various estimates of the value of this coefficient, whose actual or true value is -11, are wide of the mark. In fact, none of the 1000 estimates in this simulation proved to be statistically significant at standard levels.

Using some flavors of forward regression, therefore, you well might decide to drop x2 in the specification and try including x3.

But you would have the same type of problem in that case, too, since x2 and x3 are correlated.

I sometimes hear people appealing to stability arguments in the face of the specification problem. In other words, they strive to find a stable set of core predictors, believing that if they can do this, they will have controlled as effectively as they can for this problem of omitted variables which are correlated with other variables that are included in the specification.

Selecting Predictors

In a recent post on logistic regression, I mentioned research which developed diagnostic tools for breast cancer based on true Big Data parameters – notably 62,219 consecutive mammography records from 48,744 studies in 18,270 patients reported using the Breast Imaging Reporting and Data System (BI-RADS) lexicon and the National Mammography Database format between April 5, 1999 and February 9, 2004.

This research built a logistic regression model with 36 predictors, selected from the following information residing in the National Mammography Database (click to enlarge).


The question arises – are all these 36 predictors significant? Or what is the optimal model? How does one select the subset of the available predictor variables which really count?

This is the problem of selecting predictors in multivariate analysis – my focus for several posts coming up.

So we have a target variable y and set of potential predictors x={x1,x2,….,xn}. We are interested in discovering a predictive relationship, y=F(x*) where x* is some possibly proper subset of x. Furthermore, we have data comprising m observations on y and x, which in due time we will label with subscripts.

There are a range of solutions to this very real, very practical modeling problem.

Here is my short list.

  1. Forward Selection. Begin with no candidate variables in the model. Select the variable that boosts some goodness-of-fit or predictive metric the most. Traditionally, this has been R-Squared for an in-sample fit. At each step, select the candidate variable that increases the metric the most. Stop adding variables when none of the remaining variables are significant. Note that once a variable enters the model, it cannot be deleted.
  2. Backward Selection. This starts with the superset of potential predictors and eliminates variables which have the lowest score by some metric – traditionally, the t-statistic.
  3. Stepwise regression. This combines backward and forward selection of regressors.
  4. Regularization and Selection by means of the LASSO. Here is the classic article and here is a post, and here is a post in this blog on the LASSO.
  5. Information criteria applied to all possible regressions – pick the best specification by applying the Aikaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) to all possible combinations of regressors. Clearly, this is only possible with a limited number of potential predictors.
  6. Cross-validation or other out-of-sample criteria applied to all possible regressions – Typically, the error metrics on the out-of-sample data cuts are averaged, and the lowest average error model is selected out of all possible combinations of predictors.
  7. Dimension reduction or data shrinkage with principal components. This is a many predictors formulation, whereby it is possible to reduce a large number of predictors to a few principal components which explain most of the variation in the data matrix.
  8. Dimension reduction or data shrinkage with partial least squares. This is similar to the PC approach, but employs a reduction to information from both the set of potential predictors and the dependent or target variable.

There certainly are other candidate techniques, but this is a good list to start with.

Wonderful topic, incidentally. Dives right into the inner sanctum of the mysteries of statistical science as practiced in the real world.

Let me give you the flavor of how hard it is to satisfy the classical criterion for variable selection, arriving at unbiased or consistent estimates of effects of a set of predictors.

And, really, the paradigmatic model is ordinary least squares (OLS) regression in which the predictive function F(.) is linear.

The Specification Problem

The problem few analysts understand is called specification error.

So assume that there is a true model – some linear expression in variables multiplied by their coefficients, possibly with a constant term added.

Then, we have some data to estimate this model.

Now the specification problem is that when predictors are not orthogonal, i.e. when they are correlated, leaving out a variable from the “true” specification imparts a bias to the estimates of coefficients of variables included in the regression.

This complications sequential methods of selecting predictors for the regression.

So in any case I will have comments forthcoming on methods of selecting predictors.

Predictive Models in Medicine and Health – Forecasting Epidemics

I’m interested in everything under the sun relating to forecasting – including sunspots (another future post). But the focus on medicine and health is special for me, since my closest companion, until her untimely death a few years ago, was a physician. So I pay particular attention to details on forecasting in medicine and health, with my conversations from the past somewhat in mind.

There is a major area which needs attention for any kind of completion of a first pass on this subject – forecasting epidemics.

Several major diseases ebb and flow according to a pattern many describe as an epidemic or outbreak – influenza being the most familiar to people in North America.

I’ve already posted on the controversy over Google flu trends, which still seems to be underperforming, judging from the 2013-2014 flu season numbers.

However, combining Google flu trends with other forecasting models, and, possibly, additional data, is reported to produce improved forecasts. In other words, there is information there.

In tropical areas, malaria and dengue fever, both carried by mosquitos, have seasonal patterns and time profiles that health authorities need to anticipate to stock supplies to keep fatalities lower and take other preparatory steps.

Early Warning Systems

The following slide from A Prototype Malaria Forecasting System illustrates the promise of early warning systems, keying off of weather and climatic predictions.


There is a marked seasonal pattern, in other words, to malaria outbreaks, and this pattern is linked with developments in weather.

Researchers from the Howard Hughes Medical Institute, for example, recently demonstrated that temperatures in a large area of the tropical South Atlantic are directly correlated with the size of malaria outbreaks in India each year – lower sea surface temperatures led to changes in how the atmosphere over the ocean behaved and, over time, led to increased rainfall in India.

Another mosquito-borne disease claiming many thousands of lives each year is dengue fever.

And there is interesting, sophisticated research detailing the development of an early warning system for climate-sensitive disease risk from dengue epidemics in Brazil.

The following exhibits show the strong seasonality of dengue outbreaks, and a revealing mapping application, showing geographic location of high incidence areas.


This research used out-of-sample data to test the performance of the forecasting model.

The model was compared to a simple conceptual model of current practice, based on dengue cases three months previously. It was found that the developed model including climate, past dengue risk and observed and unobserved confounding factors, enhanced dengue predictions compared to model based on past dengue risk alone.


The latest global threat, of course, is MERS – or Middle East Respiratory Syndrome, which is a coronavirus, It’s transmission from source areas in Saudi Arabia is pointedly suggested by the following graphic.


The World Health Organization is, as yet, refusing to declare MERS a global health emergency. Instead, spokesmen for the organization say,

..that much of the recent surge in cases was from large outbreaks of MERS in hospitals in Saudi Arabia, where some emergency rooms are crowded and infection control and prevention are “sub-optimal.” The WHO group called for all hospitals to immediately strengthen infection prevention and control measures. Basic steps, such as washing hands and proper use of gloves and masks, would have an immediate impact on reducing the number of cases..

Millions of people, of course, will travel to Saudi Arabia for Ramadan in July and the hajj in October. Thirty percent of the cases so far diagnosed have resulted in fatalties.

Trend Following in the Stock Market

Noah Smith highlights some amazing research on investor attitudes and behavior in Does trend-chasing explain financial markets?

He cites 2012 research by Greenwood and Schleifer where these researchers consider correlations between investor expectations, as measured by actual investor surveys, and subsequent investor behavior.

A key graphic is the following:


This graph shows rather amazingly, as Smith points out..when people say they expect stocks to do well, they actually put money into stocks. How do you find out what investor expectations are? – You ask them – then it’s interesting it’s possible to show that for the most part they follow up attitudes with action.

This discussion caught my eye since Sornette and others attribute the emergence of bubbles to momentum investing or trend-following behavior. Sometimes Sornette reduces this to “herding” or mimicry. I think there are simulation models, combining trend investors with others following a market strategy based on “fundamentals”, which exhibit cumulating and collapsing bubbles.

More on that later, when I track all that down.

For the moment, some research put out by AQR Capital Management in Greenwich CT makes big claims for an investment strategy based on trend following –

The most basic trend-following strategy is time series momentum – going long markets with recent positive returns and shorting those with recent negative returns. Time series momentum has been profitable on average since 1985 for nearly all equity index futures, fixed income futures, commodity futures, and currency forwards. The strategy explains the strong performance of Managed Futures funds from the late 1980s, when fund returns and index data first becomes available.

This paragraph references research by Moscowitz and Pederson published in the Journal of Financial Economics – an article called Time Series Momentum.

But more spectacularly, this AQR white paper presents this table of results for a trend-following investment strategy decade-by-decade.


There are caveats to this rather earth-shaking finding, but what it really amounts to for many investors is a recommendation to look into managed futures.

Along those lines there is this video interview, conducted in 2013, with Brian Hurst, one of the authors of the AQR white paper. He reports that recently trending-following investing has run up against “choppy” markets, but holds out hope for the longer term –

At the same time, caveat emptor. Bloomberg reported late last year that a lot of investors plunging into managed futures after the Great Recession of 2008-2009 have been disappointed, in many cases, because of the high, unregulated fees and commissions involved in this type of alternative investment.

Medical/Health Predictive Analytics – Logistic Regression

The case for assessing health risk with logistic regression is made by authors of a 2009 study, which is also a sort of model example for Big Data in diagnostic medicine.

As the variables that help predict breast cancer increase in number, physicians must rely on subjective impressions based on their experience to make decisions. Using a quantitative modeling technique such as logistic regression to predict the risk of breast cancer may help radiologists manage the large amount of information available, make better decisions, detect more cancers at early stages, and reduce unnecessary biopsies

This study – A Logistic Regression Model Based on the National Mammography Database Format to Aid Breast Cancer Diagnosis  – pulled together 62,219 consecutive mammography records from 48,744 studies in 18,270 patients reported using the Breast Imaging Reporting and Data System (BI-RADS) lexicon and the National Mammography Database format between April 5, 1999 and February 9, 2004.

The combination of medical judgment and an algorithmic diagnostic tool based on extensive medical records is, in the best sense, the future of medical diagnosis and treatment.

And logistic regression has one big thing going for it – a lot of logistic regressions have been performed to identify risk factors for various diseases or for mortality from a particular ailment.

A logistic regression, of course, maps a zero/one or categorical variable onto a set of explanatory variables.

This is not to say that there are not going to be speedbumps along the way. Interestingly, these are data science speedbumps, what some would call statistical modeling issues.

Picking the Right Variables, Validating the Logistic Regression

The problems of picking the correct explanatory variables for a logistic regression and model validation are linked.

The problem of picking the right predictors for a logistic regression is parallel to the problem of picking regressors in, say, an ordinary least squares (OLS) regression with one or two complications. You need to try various specifications (sets of explanatory variables) and utilize a raft of diagnostics to evaluate the different models. Cross-validation, utilized in the breast cancer research mentioned above, is probably better than in-sample tests. And, in addition, you need to be wary of some of the weird features of logistic regression.

A survey of medical research from a few years back highlights the fact that a lot of studies shortcut some of the essential steps in validation.

A Short Primer on Logistic Regression

I want to say a few words about how the odds-ratio is the key to what logistic regression is all about.

Logistic regression, for example, does not “map” a predictive relationship onto a discrete, categorical index, typically a binary, zero/one variable, in the same way ordinary least squares (OLS) regression maps a predictive relationship onto dependent variables. In fact, one of the first things one tends to read, when you broach the subject of logistic regression, is that, if you try to “map” a binary, 0/1 variable onto a linear relationship β01x12x2 with OLS regression, you are going to come up against the problem that the predictive relationship will almost always “predict” outside the [0,1] interval.

Instead, in logistic regression we have a kind of background relationship which relates an odds-ratio to a linear predictive relationship, as in,

ln(p/(1-p)) = β01x12x2

Here p is a probability or proportion and the xi are explanatory variables. The function ln() is the natural logarithm to the base e (a transcendental number), rather than the logarithm to the base 10.

The parameters of this logistic model are β0, β1, and β2.

This odds ratio is really primary and from the logarithm of the odds ratio we can derive the underlying probability p. This probability p, in turn, governs the mix of values of an indicator variable Z which can be either zero or 1, in the standard case (there being a generalization to multiple discrete categories, too).

Thus, the index variable Z can encapsulate discrete conditions such as hospital admissions, having a heart attack, or dying – generally, occurrences and non-occurrences of something.


It’s exactly analogous to flipping coins, say, 100 times. There is a probability of getting a heads on a flip, usually 0.50. The distribution of the number of heads in 100 flips is a binomial, where the probability of getting say 60 heads and 40 tails is the combination of 100 things taken 60 at a time, multiplied into (0.5)60*(0.5)40. The combination of 100 things taken 60 at a time equals 60!/(60!40!) where the exclamation mark indicates “factorial.”

Similarly, the probability of getting 60 occurrences of the index Z=1 in a sample of 100 observations is (p)60*(1-p)40multiplied by 60!/(60!40!).

The parameters βi in a logistic regression are estimated by means of maximum likelihood (ML).  Among other things, this can mean the optimal estimates of the beta parameters – the parameter values which maximize the likelihood function – must be estimated by numerical analysis, there being no closed form solutions for the optimal values of β0, β1, and β2.

In addition, interpretation of the results is intricate, there being no real consensus on the best metrics to test or validate models.

SAS and SPSS as well as software packages with smaller market shares of the predictive analytics space, offer algorithms, whereby you can plug in data and pull out parameter estimates, along with suggested metrics for statistical significance and goodness of fit.

There also are logistic regression packages in R.

But you can do a logistic regression, if the data are not extensive, with an Excel spreadsheet.

This can be instructive, since, if you set it up from the standpoint of the odds-ratio, you can see that only certain data configurations are suitable. These configurations – I refer to the values which the explanatory variables xi can take, as well as the associated values of the βi – must be capable of being generated by the underlying probability model. Some data configurations are virtually impossible, while others are inconsistent.

This is a point I find lacking in discussions about logistic regression, which tend to note simply that sometimes the maximum likelihood techniques do not converge, but explode to infinity, etc.

Here is a spreadsheet example, where the predicting equation has three parameters and I determine the underlying predictor equation to be,


and we have the data-


Notice the explanatory variables x1 and x2 also are categorical, or at least, discrete, and I have organized the data into bins, based on the possible combinations of the values of the explanatory variables – where the number of cases in each of these combinations or populations is given to equal 10 cases. A similar setup can be created if the explanatory variables are continuous, by partitioning their ranges and sorting out the combination of ranges in however many explanatory variables there are, associating the sum of occurrences associated with these combinations. The purpose of looking at the data this way, of course, is to make sense of an odds-ratio.

The predictor equation above in the odds ratio can be manipulated into a form which explicitly indicates the probability of occurrence of something or of Z=1. Thus,

p= eβ0+β1×1+β2×2/(1+ eβ0+β1×1+β2×2)

where this transformation takes advantage of the principle that elny = y.

So with this equation for p, I can calculate the probabilities associated with each of the combinations in the data rows of the spreadsheet. Then, given the probability of that configuration, I calculate the expected value of Z=1 by the formula 10p. Thus, the mean of a binomial variable with probability p is np, where n is the number of trials. This sequence is illustrated below (click to enlarge).


Picking the “success rates” for each of the combinations to equal the expected value of the occurrences, given 10 “trials,” produces a highly consistent set of data.

Along these lines, the most valuable source I have discovered for ML with logistic regression is a paper by Scott
Czepiel – Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation

I can readily implement Czepiel’s log likelihood function in his Equation (9) with an Excel spreadsheet and Solver.

It’s also possible to see what can go wrong with this setup.

For example, the standard deviation of a binomial process with probability p and n trials is np(1-p). If we then simulate the possible “occurrences” for each of the nine combinations, some will be closer to the estimate of np used in the above spreadsheet, others will be more distant. Peforming such simulations, however, highlights that some numbers of occurrences for some combinations will simply never happen, or are well nigh impossible, based on the laws of chance.

Of course, this depends on the values of the parameters selected, too – but it’s easy to see that, whatever values selected for the parameters, some low probability combinations will be highly unlikely to produce a high number for successes. This results in a nonconvergent ML process, so some parameters simply may not be able to be estimated.

This means basically that logistic regression is less flexible in some sense than OLS regression, where it is almost always possible to find values for the parameters which map onto the dependent variable.

What This Means

Logistic regression, thus, is not the exact analogue of OLS regression, but has nuances of its own. This has not prohibited its wide application in medical risk assessment (and I am looking for a survey article which really shows the extent of its application across different medical fields).

There also are more and more reports of the successful integration of medical diagnostic systems, based in some way on logistic regression analysis, in informing medical practices.

But the march of data science is relentless. Just when doctors got a handle on logistic regression, we have a raft of new techniques, such as random forests and splines.

Header image courtesy of: National Kidney & Urologic Diseases Information Clearinghouse (NKUDIC)

Forecasts in the Medical and Health Care Fields

I’m focusing on forecasting issues in the medical field and health care for the next few posts.

One major issue is the cost of health care in the United States and future health care spending. Just when many commentators came to believe the growth in health care expenditures was settling down to a more moderate growth path, spending exploded in late 2013 and in the first quarter of 2014, growing at a year-over-year rate of 7 percent (or higher, depending on how you cut the numbers). Indeed, preliminary estimates of first quarter GDP growth would have been negative– indicating start of a possible new recession – were it not for the surge in healthcare spending.

Annualizing March 2014 numbers, US health case spending is now on track to hit a total of $3.07 trillion.

Here are estimates of month-by-month spending from the Altarum Institute.


The Altarum Institute blends data from several sources to generate this data, and also compiles information showing how medical spending has risen in reference to nominal and potential GDP.


Payments from Medicare and Medicaid have been accelerating, as the following chart from the comprehensive Center for
Disease Control (CDC) report


 Projections of Health Care Spending

One of the primary forecasts in this field is the Centers for Medicare & Medicaid Services’ (CMS) National Health Expenditures (NHE) projections.

The latest CMS projections have health spending projected to grow at an average rate of 5.8 percent from 2012-2022, a percentage point faster than expected growth in nominal GDP.

The Affordable Care Act is one of the major reasons why health care spending is surging, as millions who were previously not covered by health insurance join insurance exchanges.

The effects of the ACA, as well as continued aging of the US population and entry of new and expensive medical technologies, are anticipated to boost health care spending to 19-20 percent of GDP by 2021.


The late Robert Fogel put together a projection for the National Bureau of Economic Research (NBER) which suggested the ratio of health care spending to GDP would rise to 29 percent by 2040.

The US Health Care System Is More Expensive Than Others

I get the feeling that the econometric and forecasting models for these extrapolations – as well as the strategically important forecasts for future Medicare and Medicaid costs – are sort of gnarly, compared to the bright shiny things which could be developed with the latest predictive analytics and Big Data methods.

Neverhteless, it is interesting that an accuracy analysis of the CMS 11 year projections shows them to be are relatively good, at least one to three years out from current estimates. That was, of course, over a period with slowing growth.

But before considering any forecasting model in detail, I think it is advisable to note how anomalous the US health care system is in reference to other (highly developed) countries.

The OECD, for example, develops
interesting comparisons of medical spending
 in the US and other developed and emerging economies.


The OECD data also supports a breakout of costs per capita, as follows.


So the basic reason why the US health care system is so expensive is that, for example, administrative costs per capita are more than double those in other developed countries. Practitioners also are paid almost double that per capital of what they receive in these other countries, countries with highly regarded healthcare systems. And so forth and so on.

The Bottom Line

Health care costs in the US typically grow faster than GDP, and are expected to accelerate for the rest of this decade. The ratio of health care costs to US GDP is rising, and longer range forecasts suggest almost a third of all productive activity by mid-century will be in health care and the medical field.

This suggests either a radically different type of society – a care-giving culture, if you will – or that something is going to break or make a major transition between now and then.

A Medical Forecasting Controversy – Increased Deaths from Opting-out From Expanding Medicaid Coverage

Menzie Chinn at Econbrowser recently posted – Estimated Elevated Mortality Rates Associated with Medicaid Opt-Outs. This features projections from a study which suggests an additional 7000-17,000 persons will die annually, if 25 states opt out of Medicaid expansion associated with the Affordable Care Act (ACA). Thus, the Econbrowser chart with these extrapolations suggests within only few years the additional deaths in these 25 states would exceed causalities in the Vietnam War (58,220).

The controversy ran hot in the Comments.

Apart from the smoke and mirrors, though, I wanted to look into the underlying estimates to see whether they support such a clear connection between policy choices and human mortality.

I think what I found is that the sources behind the estimates do, in fact, support the idea that expanding Medicaid can lower mortality and, additionally, generally improve the health status of participating populations.

But at what cost – and it seems the commenters mostly left that issue alone – preferring to rant about the pernicious effects of regulation, implying more Medicaid would actually probably exert negative or no effects on mortality.

As an aside, the accursed “death panels” even came up, with a zinger by one commentator –

Ah yes, the old death panel canard. No doubt those death panels will be staffed by Nigerian born radical gay married Marxist Muslim atheists with fake birth certificates. Did I miss any of the idiotic tropes we hear on Fox News? Oh wait, I forgot…those death panels will meet in Benghazi. And after the death panels it will be on to fight the war against Christmas.

The Evidence

Econbrowser cites Opting Out Of Medicaid Expansion: The Health And Financial Impacts as the source of the impact numbers for 25 states opting out of expanded Medicaid.

This Health Affairs blog post draws on three statistical studies –

The Oregon Experiment — Effects of Medicaid on Clinical Outcomes

Mortality and Access to Care among Adults after State Medicaid Expansions

Health Insurance and Mortality in US Adults

I list these the most recent first. Two of them appear in the New England Journal of Medicine, a publication with a reputation for high standards. The third and historically oldest article appears in the American Journal of Public Health.

The Oregon Experiment is exquisite statistical research with a randomized sample and control group, but does not directly estimate mortality. Rather, it highlights the reductions in a variety of health problems from a limited expansion of Medicaid coverage for low-income adults through a lottery drawing in 2008.

Data collection included –

..detailed questionnaires on health care, health status, and insurance coverage; an inventory of medications; and performance of anthropometric and blood-pressure measurements. Dried blood spots were also obtained.

If you are considering doing a similar study, I recommend the Appendix to this research for methodological ideas. Regression, both OLS and logistic, was a major tool to compare the experimental and control groups.

The data look very clean to me. Consider, for example, these comparisons between the experimental and control groups.


Here are the basic results.


The bottom line is that the Oregon study found –

..that insurance led to increased access to and utilization of health care, substantial improvements in mental health, and reductions in financial strain, but we did not observe reductions in measured blood-pressure, cholesterol, or glycated hemoglobin levels.

The second study, published in 2012, considered mortality impacts of expanding Medicare in Arizona, Maine, and New York. New Hampshire, Pennsylvania, and Nevada and New Mexico were used as controls, in a study that encompassed five years before and after expansion of Medicaid programs.

Here are the basic results of this research.


As another useful Appendix documents, the mortality estimates of this study are based on a regression analysis incorporating county-by-county data from the study states.

There are some key facts associated with some of the tables displayed which are in the source links. Also, you would do well to click on these tables to enlarge them for reading.

The third study, by authors associated with the Harvard Medical School, had the following Abstract

Objectives. A 1993 study found a 25% higher risk of death among uninsured compared with privately insured adults. We analyzed the relationship between uninsurance and death with more recent data.

Methods. We conducted a survival analysis with data from the Third National Health and Nutrition Examination Survey. We analyzed participants aged 17 to 64 years to determine whether uninsurance at the time of interview predicted death.

Results. Among all participants, 3.1% (95% confidence interval [CI] = 2.5%, 3.7%) died. The hazard ratio for mortality among the uninsured compared with the insured, with adjustment for age and gender only, was 1.80 (95% CI = 1.44, 2.26). After additional adjustment for race/ethnicity, income, education, self- and physician-rated health status, body mass index, leisure exercise, smoking, and regular alcohol use, the uninsured were more likely to die (hazard ratio = 1.40; 95% CI = 1.06, 1.84) than those with insurance.

Conclusions. Uninsurance is associated with mortality. The strength of that association appears similar to that from a study that evaluated data from the mid-1980s, despite changes in medical therapeutics and the demography of the uninsured since that time.

Some Thoughts

Statistical information and studies are good for informing judgment. And on this basis, I would say the conclusion that health insurance increases life expectancy and reduces the incidence of some complaints is sound.

On the other hand, whether one can just go ahead and predict the deaths from a blanket adoption of an expansion of Medicaid seems like a stretch – particularly if one is going to present, as the Econbrowser post does, a linear projection over several years. Presumably, there are covariates which might change in these years, so why should it be straight-line? OK, maybe the upper and lower bounds are there to deal with this problem. But what are the covariates?

Forecasting in the medical and health fields has come of age, as I hope to show in several upcoming posts.

Looking Ahead, Looking Back

Looking ahead, I’m almost sure I want to explore forecasting in the medical field this coming week. Menzie Chin at Econbrowser, for example, highlights forecasts that suggest states opting out of expanded Medicare are flirting with higher death rates. This sets off a flurry of comments, highlighting the importance and controversy attached to various forecasts in the field of medical practice.

There’s a lot more – from bizarre and sad mortality trends among Russian men since the collapse of the Soviet Union, now stabilizing to an extent, to systems which forecast epidemics, to, again, cost and utilization forecasts.

Today, however, I want to wind up this phase of posts on forecasting the stock and related financial asset markets.

Market Expectations in the Cross Section of Present Values

That’s the title of Bryan Kelly and Seth Pruitt’s article in the Journal of Finance, downloadable from the Social Science Research Network (SSRN).

The following chart from this paper shows in-sample (IS) and out-of-sample (OOS) performance of Kelly and Pruitt’s new partial least squares (PLS) predictor, and IS and OOS forecasts from another model based on the aggregate book-to-market ratio. (Click to enlarge)


The Kelly-Pruitt PLS predictor is much better in both in-sample and out-of-sample than the more traditional regression model based on aggregate book-t0-market ratios.

What Kelly and Pruitt do is use what I would call cross-sectional time series data to estimate aggregate market returns.

Basically, they construct a single factor which they use to predict aggregate market returns from cross-sections of portfolio-level book-to-market ratios.


To harness disaggregated information we represent the cross section of asset-specific book-to-market ratios as a dynamic latent factor model. We relate these disaggregated value ratios to aggregate expected market returns and cash flow growth. Our model highlights the idea that the same dynamic state variables driving aggregate expectations also govern the dynamics of the entire panel of asset-specific valuation ratios. This representation allows us to exploit rich cross-sectional information to extract precise estimates of market expectations.

This cross-sectional data presents a “many predictors” type of estimation problem, and the authors write that,

Our solution is to use partial least squares (PLS, Wold (1975)), which is a simple regression-based procedure designed to parsimoniously forecast a single time series using a large panel of predictors. We use it to construct a univariate forecaster for market returns (or dividend growth) that is a linear combination of assets’ valuation ratios. The weight of each asset in this linear combination is based on the covariance of its value ratio with the forecast target.

I think it is important to add that the authors extensively explore PLS as a procedure which can be considered to be built from a series of cross-cutting regressions, as it were (See their white paper on three-pass regression filter).

But, it must be added, this PLS procedure can be summarized in a single matrix formula, which is


Readers wanting definitions of these matrices should consult the Journal of Finance article and/or the white paper mentioned above.

The Kelly-Pruitt analysis works where other methods essentially fail – in OOS prediction,

Using data from 1930-2010, PLS forecasts based on the cross section of portfolio-level book-to-market ratios achieve an out-of-sample predictive R2 as high as 13.1% for annual market returns and 0.9% for monthly returns (in-sample R2 of 18.1% and 2.4%, respectively). Since we construct a single factor from the cross section, our results can be directly compared with univariate forecasts from the many alternative predictors that have been considered in the literature. In contrast to our results, previously studied predictors typically perform well in-sample but become insignifcant out-of-sample, often performing worse than forecasts based on the historical mean return …

So, the bottom line is that aggregate stock market returns are predictable from a common-sense perspective, without recourse to abstruse error measures. And I believe Amit Goyal, whose earlier article with Welch contests market predictability, now agrees (personal communication) that this application of a PLS estimator breaks new ground out-of-sample – even though its complexity asks quite a bit from the data.

Note, though, how volatile aggregate realized returns for the US stock market are, and how forecast errors of the Kelly-Pruitt analysis become huge during the 2008-2009 recession and some previous recessions – indicated by the shaded lines in the above figure.

Still something is better than nothing, and I look for improvements to this approach – which already has been applied to international stocks by Kelly and Pruitt and other slices portfolio data.

Links May 2014

If there is a theme for this current Links page, it’s that trends spotted a while ago are maturing, becoming clearer.

So with the perennial topic of Big Data and predictive analytics, there is an excellent discussion in Algorithms Beat Intuition – the Evidence is Everywhere. There is no question – the machines are going to take over; it’s only a matter of time.

And, as far as freaky, far-out science, how about Scientists Create First Living Organism With ‘Artificial’ DNA.

Then there are China trends. Workers in China are better paid, have higher skills, and they are starting to use the strike. Striking Chinese Workers Are Headache for Nike, IBM, Secret Weapon for Beijing . This is a long way from the poor peasant women from rural areas living in dormitories, doing anything for five or ten dollars a day.

The Chinese dominance in the economic sphere continues, too, as noted by the Economist. Crowning the dragon – China will become the world’s largest economy by the end of the year


But there is the issue of the Chinese property bubble. China’s Property Bubble Has Already Popped, Report Says


Then, there are issues and trends of high importance surrounding the US Federal Reserve Bank. And I can think of nothing more important and noteworthy, than Alan Blinder’s recent comments.

Former Fed Leader Alan Blinder Sees Market-rattling Infighting at Central Bank

“The Fed may get more raucous about what to do next as tapering draws to a close,” Alan Blinder, a banking industry consultant and economics professor at Princeton University said in a speech to the Investment Management Consultants Association in Boston.

The cacophony is likely to “rattle the markets” beginning in late summer as traders debate how precipitously the Fed will turn from reducing its purchases of U.S. government debt and mortgage securities to actively selling it.

The Open Market Committee will announce its strategy in October or December, he said, but traders will begin focusing earlier on what will happen with rates as some members of the rate-setting panel begin openly contradicting Fed Chair Janet Yellen, he said.

Then, there are some other assorted links with good infographics, charts, or salient discussion.

Alibaba IPO Filing Indicates Yahoo Undervalued Heck of an interesting issue.


Twitter Is Here To Stay

Three Charts on Secular Stagnation Krugman toying with secular stagnation hypothesis.

Rethinking Property in the Digital Era Personal data should be viewed as property

Larry Summers Goes to Sleep After Introducing Piketty at Harvard Great pic. But I have to have sympathy for Summers, having attended my share of sleep-inducing presentations on important economics issues.


Turkey’s Institutions Problem from the Stockholm School of Economics, nice infographics, visual aids. Should go along with your note cards on an important emerging economy.

Post-Crash economics clashes with ‘econ tribe’ – economics students in England are proposing reform of the university economics course of study, but, as this link points out, this is an uphill battle and has been suggested before.

The Life of a Bond – everybody needs to know what is in this infographic.

Very Cool Video of Ocean Currents From NASA