Mergers and Acquisitions

Are we on the threshold of a rise in corporate mergers and acqusitions (M&A)?

According to the KPMA Mergers & Acquisitions Predictor, the answer is ‘yes.’

The world’s largest corporates are expected to show a greater appetite for deals in 2014 compared to 12 months ago, according to analyst predictions. Predicted forward P/E ratios (our measure of corporate appetite) in December 2013 were 16 percent higher than in December 2012. This reflects the last half of the year, which saw a 17 percent increase in forward P/E between June and December 2013. This was compared to a 1 percent fall in the previous 6 months, after concerns over the anticipated mid-year tapering of quantitative easing in the US. The increase in appetite is matched by an anticipated increase of capacity of 12 percent over the next year.

This prediction is based on

..tracking and projecting important indicators 12 months forward. The rise or fall of forward P/E (price/earnings) ratios offers a good guide to the overall market confidence, while net debt to EBITDA (earnings before interest, tax, depreciation and amortization) ratios helps gauge the capacity of companies to fund future acquisitions.


Similarly, JPMorgan forecasts 30% rebound in mergers and acquisitions in Asia for 2014.

Waves and Patterns in M&A Activity

Mergers and acquisitions tend to occur in waves, or clusters.


Source: Waves
of International Mergers and Acquisitions

It’s not exactly clear what the underlying drivers of M&A waves are, although there is a rich literature on this.

Riding the wave, for example – an Economist article – highlights four phases of merger activity, based on a recent book Masterminding the Deal: Breakthroughs in M&A Strategy and Analysis,

In the first phase, usually when the economy is in poor shape, just a handful of deals are struck, often desperation sales at bargain prices in a buyer’s market. In the second, an improving economy means that finance is more readily available and so the volume of M&A rises—but not fast, as most deals are regarded as risky, scaring away all but the most confident buyers. It is in the third phase that activity accelerates sharply, because the “merger boom is legitimised; chief executives feel it is safe to do a deal, that no one is going to criticise them for it,” says Mr Clark.

This is when the premiums that acquirers are willing to pay over the target’s pre-bid share price start to rise rapidly. In the merger waves since 1980, bid premiums in phase one have averaged just 10-18%, rising in phase two to 20-35%. In phase three, they surge past 50%, setting the stage for the catastrophically frothy fourth and final phase. This is when premiums rise above 100%, as bosses do deals so bad they are the stuff of legend. Thus, the 1980s merger wave ended soon after the disastrous debt-fuelled hostile bid for RJR Nabisco by KKR, a private-equity fund. A bestselling book branded the acquirers “Barbarians at the Gate”. The turn-of-the-century boom ended soon after Time Warner’s near-suicidal (at least for its shareholders) embrace of AOL.

This typology comes from Clark And Mills book’ ‘Masterminding The Deal’, which suggests that two-thirds of mergers fail.

In their attempt to assess why some mergers succeed while most fail, the authors offer a ranking scheme by merger type. The most successful deals are made by bottom trawlers (87%-92%). Then, in decreasing order of success, come bolt-ons, line extension equivalents, consolidation mature, multiple core related complementary, consolidation-emerging, single core related complementary, lynchpin strategic, and speculative strategic (15%-20%). Speculative strategic deals, which prompt “a collective financial market response of ‘Is this a joke?’ have included the NatWest/Gleacher deal, Coca-Cola’s purchase of film producer Columbia Pictures, AOL/Time Warner, eBay/Skype, and nearly every deal attempted by former Vivendi Universal chief executive officer Jean-Marie Messier.” (pp. 159-60)

More simply put, acquisitions fail for three key reasons. The acquirer could have selected the wrong target (Conseco/Green Tree, Quaker Oats/Snapple), paid too much for it (RBS Fortis/ABN Amro, AOL/Huffington Press), or poorly integrated it (AT&T/NCR, Terra Firma/EMI, Unum/Provident).

Be all this as it may, the signs point to a significant uptick in M&A activity in 2014. Thus, Dealogic reports that Global Technology M&A volume totals $22.4bn in 2014 YTD, up from $6.4bn in 2013 YTD and the highest YTD volume since 2006 ($34.8bn).

Global Economy Outlook – Some Problems

There seems to be a meme evolving around the idea that – while the official business outlook for 2014 is positive – problems with Chinese debt, or more generally, emerging markets could be the spoiler.

The encouraging forecasts posted by bank and financial economists (see Hatzius, for example) present 2014 as a balance of forces, with things tipping in the direction of faster growth in the US and Europe. Austerity constraints, sequestration in the US and draconian EU policies, will loosen, allowing the natural robustness of the underlying economy to assert itself – after years of sub-par performance. In the meanwhile, growth in the emerging economies is admittedly slowing, but is still is expected at much higher rates than in heartland areas of the industrial West or Japan.

So, fingers crossed, the World Bank and other official economic forecasting agencies show an uptick in economic growth in the US and, even, Europe for 2014.

But then we have articles that highlight emerging market risks:

China’s debtfuelled boom is in danger of turning to bust This Financial Times article develops the idea that only five developing countries have had a credit boom nearly as big as China’s, in each case leading to a credit crisis and slowdown. So currently Chinese “total debt” – a concept not well-defined in this short piece – is currently running about 230 per cent of gross domestic product. The article offers comparison with “33 previous credit binges” and to smaller economies, such as Taiwan, Thailand, Zimbabwe, and so forth. Strident, but not compelling.

With China Awash in Money, Leaders Start to Weigh Raising the Floodgates  From the New York Times, a more solid discussion – The amount of money sloshing around China’s economy, according to a broad measure that is closely watched here, has now tripled since the end of 2006. China’s tidal wave of money has powered the economy to new heights, but it has also helped drive asset prices through the roof. Housing prices have soared, feeding fears of a bubble while leaving many ordinary Chinese feeling poor and left out.

The People’s Bank of China has been creating money to a considerable extent by issuing more renminbi to bankroll its purchase of hundreds of billions of dollars a year in currency markets to minimize the appreciation of the renminbi against the dollar and keep Chinese exports inexpensive in foreign markets; the central bank disclosed on Wednesday that the country’s foreign reserves, mostly dollars, soared $508.4 billion last year, a record increase.


Source: New York Times

Moreover, the rapidly expanding money supply reflects a flood of loans from the banking system and the so-called shadow banking system that have kept afloat many inefficient state-owned enterprises and bankrolled the construction of huge overcapacity in the manufacturing sector.

There also are two at least two recent, relevant posts by Yves Smith – who is always on the watch for sources of instability in the banking system

How Serious is China’s Shadow Banking/Wealth Management Products Problem?

China Credit Worries Rise as Large Shadow Banking Default Looms

In addition to concerns about China, of course, there are major currency problems developing for Russia, India, Chile, Brazil, Turkey, South Africa, and Argentina.


From the Economist The plunging currency club

So there are causes for concern, especially with the US Fed, under Janet Yellen, planning on winding down QE or quantitative easing.

When Easy Money Ends is a good read in this regard, highlighting the current scale of QE (quantitative easing) programs globally, and savings from lower interest rates – coupled with impacts of higher interest rates.

Since the start of the financial crisis, the Fed, the European Central Bank, the Bank of England, and the Bank of Japan have used QE to inject more than $4 trillion of additional liquidity into their economies…If interest rates were to return to 2007 levels, interest payments on government debt could rise by 20%, other things being equal…US and European nonfinancial corporations saved $710 billion from lower debt-service payments, with ultralow interest rates thus boosting profits by about 5% in the US and the UK, and by 3% in the euro-zone. This source of profit growth will disappear as interest rates rise, and some firms will need to reconsider business models – for example, private equity – that rely on cheap capital…We could also witness the return of asset-price bubbles in some sectors, especially real estate, if QE continues. The International Monetary Fund noted in 2013 that there were already “signs of overheating in real-estate markets” in Europe, Canada, and some emerging-market economies. 

Climate Gotterdammerung

For video fans, here are three videos on climate change and global warming. Be sure and see the third – it’s very dramatic.

White House smacks down climate deniers in new video

“If you’ve been hearing that extreme cold spells like the one that we’re having in the United States now disprove global warming, don’t believe it,” Holdren [White House Science Advisor] says in the video, before launching into a succinct explanation of how uneven global temperature changes are destabilizing the polar vortex and making it “wavier.”

“The waviness means that there can be increased, larger excursions of wintertime cold air southward,” Holdren says. He adds that “increased excursions of relatively warmer” air can also move into the “far north” as the globe warms.

NASA Graphic Shows Six Terrifying Decades Of Global Warming (VIDEO)

Largest Glacier Calving Ever Filmed

This is from “Chasing Ice.” James Balog, the National Geographic photographer, speaks at the end of the film, and his assistants, are staked out on a high ridge above all this and took the videos.

Some of shards of ice are three times taller than the skyscrapers in Lower Manhattan – a comparable area to the breakup zone.

Hal Varian and the “New” Predictive Techniques

Big Data: New Tricks for Econometrics is, for my money, one of the best discussions of techniques like classification and regression trees, random forests, and penalized  regression (such as lasso, lars, and elastic nets) that can be found.

Varian, pictured aove, is emeritus professor in the School of Information, the Haas School of Business, and the Department of Economics at the University of California at Berkeley. Varian retired from full-time appointments at Berkeley to become Chief Economist at Google.

He also is among the elite academics publishing in the area of forecasting according to IDEAS!.

Big Data: New Tricks for Econometrics, as its title suggests, uses the wealth of data now being generated (Google is a good example) as a pretext for promoting techniques that are more well-known in machine learning circles, than in econometrics or standard statistics, at least as understood by economists.

First, the sheer size of the data involved may require more sophisticated 18 data manipulation tools. Second, we may have more potential predictors than appropriate for estimation, so we need to do some kind of variable selection. Third, large data sets may allow for more flexible relationships than simple linear models. Machine learning techniques such as decision trees, support vector machines, neural nets, deep learning and so on may allow for more effective ways to model complex relationships.

He handles the definitional stuff deftly, which is good, since there is not standardization of terms yet in this rapidly evolving field of data science or predictive analytics, whatever you want to call it.

Thus, “NoSQL” databases are

sometimes interpreted as meaning “not only SQL.” NoSQL databases are more primitive than SQL databases in terms of data manipulation capabilities but can handle larger amounts of data.

The essay emphasizes out-of-sample prediction and presents a nice discussion of k-fold cross validation.

1. Divide the data into k roughly equal subsets and label them by s =1; : : : ; k. Start with subset s = 1.

2. Pick a value for the tuning parameter.

3. Fit your model using the k -1 subsets other than subset s.

4. Predict for subset s and measure the associated loss.

5. Stop if s = k, otherwise increment s by 1 and go to step 2.

Common choices for k are 10, 5, and the sample size minus 1 (“leave one out”). After cross validation, you end up with k values of the tuning parameter and the associated loss which you can then examine to choose an appropriate value for the tuning parameter. Even if there is no tuning parameter, it is useful to use cross validation to report goodness-of-t measures since it measures out-of-sample performance which is what is typically of interest.

Varian remarks that Test-train and cross validation, are very commonly used in machine learning and, in my view, should be used much more in economics, particularly when working with large datasets

But this essay is by no means methodological, and presents several nice worked examples, showing how, for example, regression trees can outperform logistic regression in analyzing survivors of the sinking of the Titanic – the luxury ship, and how several of these methods lead to different imputations of significance to the race factor in the Boston Housing Study.

The essay also presents easy and good discussions of bootstrapping, bagging, boosting, and random forests, among the leading examples of “new” techniques – new to economists.

For the statistics wonks, geeks, and enthusiasts among readers, here is a YouTube presentation of the paper cited above with extra detail.


Some Observations on Cluster Analysis (Data Segmentation)

Here are some notes and insights relating to clustering or segmentation.

1. Cluster Analysis is Data Discovery

Anil Jain’s 50-year retrospective on data clustering emphasizes data discovery – Clustering is inherently an ill-posed problem where the goal is to partition the data into some unknown number of clusters based on intrinsic information alone.

Clustering is “ill-posed” because identifying groups turns on the concept of similarity, and there are, as Jain highlights, many usable distance metrics to deploy, defining similarity of groups.

Also, an ideal cluster is compact and isolated, but, implicitly, this involves a framework of specific dimensions or coordinates, which may themselves be objects of choice. Thus, domain knowledge almost always comes into play in evaluating any specific clustering.

The importance of”subjective” elements is highlighted by research into wines, where, according to Gallo research, the basic segments are sweet and fruity, light body and fruity, medium body and rich flavor, medium body and light oak, and full body and robust flavor.

K-means clustering on chemical features of wines may or may not capture these groupings – but they are compelling to Gallo product development and marketing.

The domain expert looks over the results of formal clustering algorithms and makes the judgment call as to how many significant clusters they are and what they are.

2. Cluster Analysis is Unsupervised Learning

Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning).

Supervised  learning means is that there are classification labels in the dataset.

Linear discriminant analysis, as developed by the statistician Fisher, maximizes the distances between points in different clusters.

K-means clustering, on the other hand, classically minimizes the distances between the points in each cluster and the centroids of these clusters.

This is basically the difference between the inner products and the outer product of the relevant vectors.

3. Data reduction through principal component analysis can be helpful to clustering

K-means clustering in the sub-space defined by the first several principal components can boost a cluster analysis. I offer an example based on the University of California at Irving (UCI) Machine Learning Depository, where it is possible to find the Wine database.


The Wine database dates from 1996 and involves 14 variables based on a chemical analysis of wines grown in the same region of Italy, but derived from three different cultivars.

Dataset variables include (1) alcohol, (2) Malic acid, (3) Ash, (4) Alcalinity of ash, (5) Magnesium, (6) Total phenols, (7) Flavanoids, (8) Nonflavanoid phenols, (9) Proanthocyanins, (10) Color intensity, (11) Hue, (12) OD280/OD315 of diluted wines, and (13) Proline.

The dataset also includes, in the first column, a 1, 2, or 3 for the cultivar whose chemical properties follow. A total of 178 wines are listed in the dataset.

To develop my example, I first run k-means clustering on the original wine dataset – without, of course, the first column designating the cultivars. My assumption is that cluster analysis ought to provide a guide as to which cultivar the chemical data come from.

I ran a search for three segments.

The best match I got was in predicting membership of wines from the first cultivar. I get an approximately 78 percent hit rate – 77.9 percent of the first cultivar are correctly identified by a k-means cluster analysis of all 13 variables in the Wine dataset.  The worst performance is with the third cultivar – less than 50 percent accuracy in identification of this segment. There were a lot of false positives, in other words – a lot of instances where the k-means analysis of the whole dataset indicates that a wine stems from the third cultivar, when in fact it does not.

For comparison, I calculate the first three principal components, after standardizing the wine data. Then, I run the k-means clustering algorithm on the scores produced by these first three or the three most important principal components.

There is dramatic improvement.

I achieve a 92 percent score in predicting association with the first cultivar, and higher scores in predicting the next two cultivars.

The Matlab code for this is straight-forward. After importing the wine data from a spreadsheet with  x=xlsread(“Wine”), I execute IDX=kmeans(y,3), where the data matrix y is just the original data x stripped of its first column. The result matrix IDX gives the segment numbers of the data, organized by row. These segment values can be compared with the cultivar number to see how accurate the segmentation is. The second run involved using [COEFF, SCORE, latent] = princomp(zscore(y)) grabbing the first three SCORE’s and then segmenting them with the kmeans(3SCORE,3) command. It is not always this easy, but this example, which is readily replicated, shows that in some cases, radical improvements in segmentation can be achieved by working just with a subspace determined by the first several principal components.

4. More on Wine Segmentation

How Gallo Brings Analytics Into The Winemaking Craft is a case study on data mining as practiced by Gallo Wine, and is almost a text-book example of Big Data and product segmentation in the 21st Century.

Gallo’s analytics maturation mirrors the broader IT industry’s move from rear-view mirror reporting to predictive and proactive analytics. Gallo uses the deep insight it gets from all of this analytics to develop new breakout brands.

Based on consumer surveys at tasting events and in tasting rooms at its California and Washington vineyards, Gallo sees five core wine style clusters:

  • sweet and fruity
  • light body and fruity
  • medium body and rich flavor
  • medium body and light oak
  • full body and robust flavor

Gallo maps its own and competitors’ products to these clusters, and correlates them with internal sales data and third-party retail trend data to understand taste preferences and emerging trends in different markets.

In one brand development effort, Gallo spotted big potential demand for a blended red wine that would appeal to the first three of its style clusters. It used extensive knowledge of the flavor characteristics of more than 5,000 varieties of grapes and data on varietal business fundamentals—like the availability and cost patterns of different grapes from season to season—to come up with the Apothic brand last year. After just a year on the market, Apothic is expected to sell 1 million cases with the help of a new white blend.

The Cheapskates Wine Guide, incidentally, has a fairly recent and approving post on Apothic.

Unfortunately, Gallo is not inclined to share its burgeoning deep data on wine preferences and buying by customer type and sales region.

But there is an open access source for market research and other datasets for testing data science and market research techniques.

5. The Optimal Number of Clusters in K-means Clustering

Two metrics for assessing the number of clusters in k-means clustering employ the “within-cluster sum of squares” W(k) and the “between-cluster sum of squares” B(k) – where k indicates the number of clusters.

These metrics are the CH index and Hartigan’s index applied here to the “Wine” database from the Machine Learning databases at Cal Irvine.

Recall that the Wine data involve 14 variables from a chemical analysis of 178 wines grown in a region of Italy, derived from three cultivars. The dataset includes, in the first column, a 1,2, or 3 indicating the cultivar of each wine.

This dataset is ideal for supervised learning with, for example, linear discriminant analysis, but it also supports an interesting exploration of the performance of k-means clustering – an unsupervised learning technique.

Previously, we used the first three principal components to cluster the Wine dataset, arriving at fairly accurate predictions of which cultivar the wines came from.

Matlab’s k-means algorithm returns several relevant pieces of data, including the cluster assignment for each case or, here, 13 dimensional point (stripping off the cultivar identifier at the start), but also information on the within-cluster and total cluster sum of squares, and, of course, the final centroid values.

This makes computation of the CH index and Hartigan’s index easy, and, in the case of the Wine dataset, leads to the following graph.


The CH index has an “elbow” at k=3 clusters. The possibility that this indicates the optimal number of clusters is enhanced by the rather abupt change of slope in Hartigan’s index.

Of course, we know that there are good grounds for believing that three is the optimal number of clusters .

With this information, then, one could feel justified in looking at and clustering the first three principal components, which would produce a good indicator of the cultivar of the wine.


The objective of k-means clustering is to minimize the within-cluster sum of squares or the squared differences between the points belonging to a cluster and the associated centroid of that cluster.

Thus, suppose we have m observations on n-dimensional points xi=(x1i, x2i,..,xni), and, for this discussion, assume these points are mean-centered and divided by their column standard deviations.

Designate the centroids by the vector m=(m1, m2, …,mk). These are the average values for each variable for all the points assigned to a cluster.

Then, the objective is to minimize,


This problem is, in general, NP difficult, but the standard algorithm is straight-forward. The number of clusters k is selected. Then, an initial set of centroids is determined, by one means or another, and distances of points to these centroids are calculated. Points closest to the centroids are assigned to the cluster associated with that centroid. Then, the centroids are recalculated, and distances between the points and centroids are computed again. This loops until there are no further changes in centroid values or cluster assignment.

Minimization of the above objective function frequently leads to a local minima, so usually algorithms develop multiple assignments of the initial centroids, with the final solution being the cluster assignment with the lowest value of W(k).

Calculating the CH and Hartigan Indexes

Let me throw up a text box from the excellent work Computational Systems Biology of Cancer .


Making allowances for slight variation in notation, one can see that the CH index is a ratio of the between-cluster sum of squares and the within-cluster sum of squares, multiplied by constants determined by the number of clusters and number of cases or observations in the dataset.

Also, Hartigan’s index is a ratio of successive within-cluster sum of squares, divided by constants determined by the number of observations and number of clusters.

I focused on clustering the first ten principal components of the Wine data matrix in the graph above, incidentally.

Let me also quote from the above-mentioned work, which is available as a Kindle download from Amazon,

A generic problem with clustering methods is to choose the number of groups K. In an ideal situation in which samples would be well partitioned into a finite number of clusters and each cluster would correspond to the various hypothesis made by a clustering method (e.g. being a spherical Gaussian distribution), statistical criteria such as the Bayesian Information Criterion (BIC) can be used to select the optimal K consistently (Kass and Wasserman, 1995; Pelleg and Moore, 2000). On real data, the assumptions underlying such criteria are rarely met, and a variety of more or less heuristic criteria have been proposed to select a good number of clusters K (Milligan and Cooper, 1985;Gordon, 1999). For example, given a sequence of partitions C1, C2, …with k = 1, 2,…  groups, a useful and simple method is to monitor the decrease in W( Ck) (see Box 5.3) with k, and try to detect an elbow in the curve, i.e. a transition between sharp and slow decrease (see Figure 5.4, upper left). Alternatively, several statistics have been proposed to precisely detect a change of regime in this curve (see Box 5.4). For hierarchical clustering methods, the selection of clusters is often performed by searching the branches of dendrograms which are stable with respect to within- and between-group distance (Jain et al., 1999; Bertoni and Valentini, 2008).

It turns out that key references here are available on the Internet.

Thus, the Aikake or Bayesian Information Criteria applied to the number of clusters is described in Dan Pelleg and Andrew Moore’s 2000 paper X-means: Extending K-means with Efficient Estimation of the Number of Clusters.

Robert Tibshirani’s 2000 paper on the gap statistic also is available.

Another example of the application of these two metrics is drawn from this outstanding book on data analysis in cancer research, below.


These methods of identifying the optimal number of clusters might be viewed as heuristic, and certainly are not definitive. They are, however, possibly helpful in a variety of contexts.

Top Forecasting Institutions and Researchers According to IDEAS!

Here is a real goldmine of research on forecasting.

IDEAS! is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis.

This website compiles rankings on authors who have registered with the RePEc Author Service, institutions listed on EDIRC, bibliographic data collected by RePEc, citation analysis performed by CitEc and popularity data compiled by LogEc – under the category of forecasting.

Here is a list of the top fifteen of the top 10% institutions in the field of forecasting, according to IDEAS!. The institutions are scored based on a weighted sum of all authors affiliated with the respective institutions (click to enlarge).


The Economics Department of the University of Wisconsin, the #1 institution, lists 36 researchers who claim affiliation and whose papers are listed under the category forecasting in IDEAS!.

The same IDEAS! Webpage also lists the top 10% authors in the field of forecasting. I extract the top 20 of this list here below. If you click through on an author, you can see their list of publications, many of which often are available as PDF downloads.


This is a good place to start in updating your knowledge and understanding of current thinking and contextual issues relating to forecasting.

The Applied Perspective

For an applied forecasting perspective, there is Bloomberg with this fairly recent video on several top economic forecasters providing services to business and investors.

I believe Bloomberg will release extensive, updated lists of top forecasters by country, based on a two year perspective, in a few weeks.

What’s the Lift of Your Churn Model? – Predictive Analytics and Big Data

Churn analysis is a staple of predictive analytics and big data. The idea is to identify attributes of customers who are likely leave a mobile phone plan or other subscription service, or, more generally, switch who they do business with. Knowing which customers are likely to “churn” can inform customer retention plans. Such customers, for example, may be contacted in targeted call or mailing campaigns with offers of special benefits or discounts.

Lift is a concept in churn analysis. The lift of a target group identified by churn analysis reflects the higher proportion of customers who actually drop the service or give someone else their business, when compared with the population of customers as a whole. If, typically, 2 percent of customers drop the service per month, and, within the group identified as “churners,” 8 percent drop the service, the “lift” is 4.

In interesting research, originally published in the Harvard Business Review, Gregory Piatetsky-Shapiro questions the efficacy of big data applied to churn analysis – based on an estimation of costs and benefits.

We looked at some 30 different churn-modeling efforts in banking and telecom, and surprisingly, although the efforts used different data and different modeling algorithms, they had very similar lift curves. The lists of top 1% likely defectors had a typical lift of around 9-11. Lists of top 10% defectors all had a lift of about 3-4. Very similar lift curves have been reported in other work. (See here and here.) All this suggests a limiting factor to prediction accuracy for consumer behavior such as churn.

Backtracking through earlier research by Piatetsky-Shapiro and his co-researchers, there is this nugget,

For targeted marketing campaigns, a good model lift at T, where T is the target rate in the overall population, is usually sqrt(1/T) +/- 20%.

So, if the likely “churners” are 5 percent of the customer group, a reasonable expectation of the lift that can be obtained from churn analysis is 4.47. This means probably no more than 25 percent of the target group identified by the churn analysis will, in fact, do business elsewhere in the defined period.

This is a very applied type of result, based on review of 30 or more studies.

But the point Piatetsky-Shapiro make is that big data probably can’t push these lift numbers much higher, because of the inherent randomness in the behavior of consumers. And small gains to existing methods simply do not meet a cost/benefit criterion.

Some Israeli researchers may in fact best these numbers with a completely different approach based on social network analysis. Their initial working hypothesis was that social influence on churn is highly dominant in relatively tight social groups. Their approach is clearly telecommunications-based, since they analyzed patterns of calling between customers, identifying networks of callers who had more frequent communications.

Still, there is a good argument for an evolution from standard churn analysis to predictive analytics that uncovers the value-at-risk in the customer base, or even the value that can be saved by customer retention programs. Customers who have trouble paying their bill, for example, might well be romanced less strongly by customer retention efforts, than premium customers.


Along these lines, I enjoyed reading the Stochastic Solutions piece on who can be saved and who will be driven away by retention activity, which is responsible for the above graphic.

It has been repeatedly demonstrated that the very act of trying to ‘save’ some customers provokes them to leave. This is not hard to understand, for a key targeting criterion is usually estimated churn probability, and this is highly correlated with customer dissatisfaction. Often, it is mainly lethargy that is preventing a dissatisfied customer from actually leaving. Interventions designed with the express purpose of reducing customer loss can provide an opportunity for such dissatisfaction to crystallise, provoking or bringing forward customer departures that might otherwise have been avoided, or at least delayed. This is especially true when intrusive contact mechanisms, such as outbound calling, are employed. Retention programmes can be made more effective and more profitable by switching the emphasis from customers with a high probability of leaving to those likely to react positively to retention activity.

This is a terrific point. Furthermore,

..many customers are antagonised by what they feel to be intrusive contact mechanisms; indeed, we assert without fear of contradiction that only a small proportion of customers are thrilled, on hearing their phone ring, to discover that the caller is their operator. In some cases, particularly for customers who are already unhappy, such perceived intrusions may act not merely as a catalyst but as a constituent cause of churn.

Bottom-line, this is among the most interesting applications of predictive analytics.

Logistic regression is a favorite in analyzing churn data, although techniques range from neural networks to regression trees.

Measuring the Intelligence of Crowds

Researchers at Microsoft Research in the UK and Cambridge University report some fascinating and potentially useful results on crowdsourcing, based on a study of aggregating questions from a standard IQ test on Amazon’s Mechanical Turk (AMT).

The AMT site provides a place where workers can find problems that requesters have set up for crowdsourcing.

The introductory page to the site looks like this (click to enlarge).


So here’s an interesting way for people to make some money working from home, at their own hours, and yet stay busy. I’d like to look more deeply into this in a future post, but what these Crowd IQ researchers did is divvy up the questions from a widely utilized IQ test on the AMT site. They studied the effects of changing several parameters on their measures of Crowd IQ, but basically found that, with five or more reputable workers in a group, the Crowd IQ was usually higher than that of the individual workers in the group.

The Abstract for their 2012 study Crowd IQ: Measuring the Intelligence of Crowdsourcing Platforms describes the research and findings succinctly:

We measure crowdsourcing performance based on a standard IQ questionnaire, and examine Amazon’s Mechanical Turk (AMT) performance under different conditions. These include variations of the payment amount offered, the way incorrect responses affect workers’ reputations, threshold reputation scores of participating AMT workers, and the number of workers per task. We show that crowds composed of workers of high reputation achieve higher performance than low reputation crowds, and the effect of the amount of payment is non-monotone—both paying too much and too little affects performance. Furthermore, higher performance is achieved when the task is designed such that incorrect responses can decrease workers’ reputation scores. Using majority vote to aggregate multiple responses to the same task can significantly improve performance, which can be further boosted by dynamically allocating workers to tasks in order to break ties.

The IQ test is Raven’s Standard Progressive Matrices (SPM). If you want to take the test, look here.

SPM is a nonverbal, multiple-choice intelligence test based on the theory of general ability. The general setup is as in the following example.


Free riders are an interesting problem in a site like the Mechanical Turk. So, if people get paid by the number of correct answers, some simply select responses at random to maximize the speed at which they can put up answers. Because of this, AMT has a reputation mechanism indicating the expected quality of work of a worker, based on his or her past performance.

This research is has real-world implications. For example, increasing the payment for tasks too much results in actually diminuishing the quality of the answers, for a variety of reasons the authors consider.

The “workers” in this AMT-based study did not consult with each other about the answers, but were grouped into teams somehow by the researchers.

Here is a chart showing the increase in crowd IQ with the number of people in the group.


Here a HIT refers to a Human Intelligence Task.


First, experiment and monitor the performance. Our results suggest that relatively small changes to the parameters of the task may result in great changes in crowd performance. Changing parameters of the task (e.g. reward, time limits, reputation rage) and observing changes in performance may allow you to greatly increase performance. Second, make sure to threaten workers’ reputation by emphasizing that their solutions will be monitored and wrong responses rejected. Obviously, in a real-world setting it may be hard to detect free-riders without using a “gold-set” of test questions to which the requester already knows the correct response. However, designing and communicating HIT rejection conditions can discourage free riding or make it risky and more difficult. For instance, in the case of translation tasks requesters should determine what is not acceptable (e.g. using Google Translate) and may suggest that the response quality would be monitored and solutions of low quality would be rejected. Third, do not over-pay. Although the reward structure obviously depends on the task at hand and the expected amount of effort required to solve it, our results suggest that pricing affects not only the ability to s source enough workers to perform the task but also the quality of the obtained results. Higher rewards are likely to encourage a free-riding behavior and may affect the cognitive abilities of workers by increasing psychological pressure. Thus, for long term projects or tasks that are run repeatedly in a production environment, we believe it is worthwhile to experiment with the reward scheme in order to
find an optimum reward level. Fourth, aggregate multiple solutions to each HIT, preferably using an adaptive sourcing scheme. Even the simplest aggregation method – majority voting – has a potential to greatly improve the quality of the solution. In the context of more complicated tasks, e.g. translations, requesters may consider a two-stage design in which they first request several solutions, and then use another batch of workers to vote for the best one. Additionally, requesters may consider inspecting the responses provided by individuals that often disagree with the crowd – they might be coveted geniuses or free-riders deserving rejection.

Interesting stuff, and makes you want to try crowdsourcing.

Crime Prediction

PredPol markets a crime prediction system tested in and currently used by Los Angeles, CA and Seattle, WA, and under evaluation elsewhere (London, UK). The product takes historic statistics and generates real-time predictions of where new crimes are likely to occur – within highly localized areas.

The spec sheet calls it “cloud-based, easy-to-use” software, offering this basic description.


This has generated lots of press and TV coverage.

In July 2013, there was a thoughtful article in the Economist Don’t even think about it and a piece on National Public Radio (NPR).

A YouTube video features a contribution from one of the company founders – Jeffrey Brantingham.

From what I glean, PredPol takes the idea of crime hotspots a step further, identifying behavioral patterns in burglaries and other property crimes – such as the higher probability of a repeat break-in, or increased probability of a break-in to a neighbor of a house that has been burglarized. Transportation access to and egress from crime sites is also important to criminals – the easier, the better.

The proof is in the pudding. And there have been reductions in property crime in locales where the PredPol system is being applied, although not necessarily increases in arrests. The rationale is that sending additional patrols into the targeted areas deters criminals.

Maybe some of these would-be criminals go elsewhere to rob and steal, but others may simply be deterred, given the criminal mind is at least partly motivated by sheer laziness.

Criticism of PredPol

I can think of several potential flaws.

  • Analytically, there have to be dynamic effects from the success of PredPol in any locale. If successful, in other words, the algorithm will change the crime pattern, and then what?
  • Also, there is a risk of sort of fooling oneself, if the lower crime stats are taken as evidence that the software is effective. Maybe crimes would have decreased anyway.
  • And there are constitutional issues, if police simply stop people to prevent their committing a crime before it has happened, based on the predictions of the software.

Last November, some of the first critical articles about PredPol came out, motivated in part by a SFWeekly article All Tomorrow’s Crimes: The Future of Policing Looks a Lot Like Good Branding

In the meantime, PredPol seems destined for wide application in larger urban areas, and is surely has some of the best PR of any implementation of Big Data and predictive analytics.

Boosting Time Series

If you learned your statistical technique more than ten years ago, consider it necessary to learn a whole bunch of new methods. Boosting is certainly one of these.

Let me pick a leading edge of this literature here – boosting time series predictions.


Let’s go directly to the performance improvements.

In Boosting multi-step autoregressive forecasts, (Souhaib Ben Taieb and Rob J Hyndman, International Conference on Machine Learning (ICML) 2014) we find the following Table applying boosted time series forecasts to two forecasting competition datasets –


The three columns refer to three methods for generating forecasts over horizons of 1-18 periods (M3 Competition and 1-56 period (Neural Network Competition). The column labeled BOOST is, as its name suggests, the error metric for a boosted time series prediction. Either by the lowest symmetric mean absolute percentage error or a rank criterion, BOOST usually outperforms forecasts produced recursively from an autoregressive (AR) model, or forecasts from an AR model directly mapped onto the different forecast horizons.

There were a lot of empirical time series involved in these two datasets –

The M3 competition dataset consists of 3003 monthly, quarterly, and annual time series. The time series of the M3 competition have a variety of features. Some have a seasonal component, some possess a trend, and some are just fluctuating around some level. The length of the time series ranges between 14 and 126. We have considered time series with a range of lengths between T = 117 and T = 126. So, the number of considered time series turns out to be M = 339. For these time series, the competition required forecasts for the next H = 18 months, using the given historical data. The NN5 competition dataset comprises M = 111 time series representing roughly two years of daily cash withdrawals (T = 735 observations) at ATM machines at one of the various cities in the UK. For each time series, the  competition required to forecast the values of the next H = 56 days (8 weeks), using the given historical data.

This research, notice of which can be downloaded from Rob Hyndman’s site, builds on the methodology of Ben Taieb and Hyndman’s recent paper in the International Journal of Forecasting A gradient boosting approach to the Kaggle load forecasting competition. Ben Taieb and Hyndman’s submission came in 5th out of 105 participating teams in this Kaggle electric load forecasting competition, and used boosting algorithms.

Let me mention a third application of boosting to time series, this one from Germany. So we have Robinzonov, Tutz, and Hothorn’s Boosting Techniques for Nonlinear Time Series Models (Technical Report Number 075, 2010 Department of Statistics University of Munich) which focuses on several synthetic time series and predictions of German industrial production.

Again, boosted time series models comes out well in comparisons.


GLMBoost or GAMBoost are quite competitive at these three forecast horizons for German industrial production.

What is Boosting?

My presentation here is a little “black box” in exposition, because boosting is, indeed, mathematically intricate, although it can be explained fairly easily at a very general level.

Weak predictors and weak learners play an important role in bagging and boosting –techniques which are only now making their way into forecasting and business analytics, although the machine learning community has been discussing them for more than two decades.

Machine learning must be a fascinating field. For example, analysts can formulate really general problems –

In an early paper, Kearns and Valiant proposed the notion of a weak learning algorithm which need only achieve some error rate bounded away from 1/2 and posed the question of whether weak and strong learning are equivalent for efficient (polynomial time) learning algorithms.

So we get the “definition” of boosting in general terms:

Boosting algorithms are procedures that “boost” low-accuracy weak learning algorithms to achieve arbitrarily high accuracy.

And a weak learner is a learning method that achieves only slightly better than chance correct classification of binary outcomes or labeling.

This sounds like the best thing since sliced bread.

But there’s more.

For example, boosting can be understood as a functional gradient descent algorithm.

Now I need to mention that some of the most spectacular achievements in boosting come in classification. A key text is the recent book Boosting: Foundations and Algorithms (Adaptive Computation and Machine Learning series) by Robert E. Schapire and Yoav Freund. This is a very readable book focusing on AdaBoost, one of the early methods and its extensions. The book can be read on Kindle and is starts out –


So worth the twenty bucks or so for the download.

The papers discussed above vis a vis boosting time series apply p-splines in an effort to estimate nonlinear effects in time series. This is really unfamiliar to most of us in the conventional econometrics and forecasting communities, so we have to start conceptualizing stuff like “knots” and component-wise fitting algortihms.

Fortunately, there is a canned package for doing a lot of the grunt work in R, called mboost.

Bottom line, I really don’t think time series analysis will ever be the same.