Category Archives: data science

Links – February 1, 2014

IT and Big Data

Kayak and Big Data Kayak is adding prediction of prices of flights over the coming 7 days to its meta search engine for the travel industry.

China’s Lenovo steps into ring against Samsung with Motorola deal Lenovo Group, the Chinese technology company that earns about 80 percent of its revenue from personal computers, is betting it can also be a challenger to Samsung Electronics Co Ltd and Apple Inc in the smartphone market.

5 Things To Know About Cognitive Systems and IBM Watson Rob High video on Watson at Valuable to review. Watson is probably different than you think. Deep natural language processing.

Playing Computer Games and Winning with Artificial Intelligence (Deep Learning) Pesents the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards… [applies] method to seven Atari 2600 games from the Arcade Learning Environment, with no adjustment of the architecture or learning algorithm…outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

Global Economy

China factory output points to Q1 lull Chinese manufacturing activity slipped to its lowest level in six months, with indications of slowing growth for the quarter to come in the world’s second-largest economy.

Japan inflation rises to a 5 year high, output rebounds Japan’s core consumer inflation rose at the fastest pace in more than five years in December and the job market improved, encouraging signs for the Bank of Japan as it seeks to vanquish deflation with aggressive money printing.

Coup Forecasts for 2014


World risks deflationary shock as BRICS puncture credit bubbles Ambrose Evans-Pritchard does some nice analysis in this piece.

Former IMF Chief Economist, Now India’s Central Bank Governor Rajan Takes Shot at Bernanke’s Destabilizing Policies

Some of his key points:

Emerging markets were hurt both by the easy money which flowed into their economies and made it easier to forget about the necessary reforms, the necessary fiscal actions that had to be taken, on top of the fact that emerging markets tried to support global growth by huge fiscal and monetary stimulus across the emerging markets. This easy money, which overlaid already strong fiscal stimulus from these countries. The reason emerging markets were unhappy with this easy money is “This is going to make it difficult for us to do the necessary adjustment.” And the industrial countries at this point said, “What do you want us to do, we have weak economies, we’ll do whatever we need to do. Let the money flow.”

Now when they are withdrawing that money, they are saying, “You complained when it went in. Why should you complain when it went out?” And we complain for the same reason when it goes out as when it goes in: it distorts our economies, and the money coming in made it more difficult for us to do the adjustment we need for the sustainable growth and to prepare for the money going out

International monetary cooperation has broken down. Industrial countries have to play a part in restoring that, and they can’t at this point wash their hands off and say we’ll do what we need to and you do the adjustment. ….Fortunately the IMF has stopped giving this as its mantra, but you hear from the industrial countries: We’ll do what we have to do, the markets will adjust and you can decide what you want to do…. We need better cooperation and unfortunately that’s not been forthcoming so far.

Science Perspective

Researchers Discover How Traders Act Like Herds And Cause Market Bubbles

Building on similarities between earthquakes and extreme financial events, we use a self-organized criticality-generating model to study herding and avalanche dynamics in financial markets. We consider a community of interacting investors, distributed in a small-world network, who bet on the bullish (increasing) or bearish (decreasing) behavior of the market which has been specified according to the S&P 500 historical time series. Remarkably, we find that the size of herding-related avalanches in the community can be strongly reduced by the presence of a relatively small percentage of traders, randomly distributed inside the network, who adopt a random investment strategy. Our findings suggest a promising strategy to limit the size of financial bubbles and crashes. We also obtain that the resulting wealth distribution of all traders corresponds to the well-known Pareto power law, while that of random traders is exponential. In other words, for technical traders, the risk of losses is much greater than the probability of gains compared to those of random traders.

Blogs review: Getting rid of the Euler equation – the equation at the core of modern macro The Euler equation is one of the fundamentals, at a deep level, of dynamic stochastic general equilibrium (DSGE) models promoted as the latest and greatest in theoretical macroeconomics. After the general failures in mainstream macroeconomics with 2008-09, DGSE have come into question, and this review is interesting because it suggests, to my way of thinking, that the Euler equation linking past and future consumption patterns is essentially grafted onto empirical data artificially. It is profoundly in synch with neoclassical economic theory of consumer optimization, but cannot be said to be supported by the data in any robust sense. Interesting read with links to further exploration.

BOSTON COLLOQUIUM FOR PHILOSOPHY OF SCIENCE: Revisiting the Foundations of Statistics – check this out – we need the presentations online.

Hal Varian and the “New” Predictive Techniques

Big Data: New Tricks for Econometrics is, for my money, one of the best discussions of techniques like classification and regression trees, random forests, and penalized  regression (such as lasso, lars, and elastic nets) that can be found.

Varian, pictured aove, is emeritus professor in the School of Information, the Haas School of Business, and the Department of Economics at the University of California at Berkeley. Varian retired from full-time appointments at Berkeley to become Chief Economist at Google.

He also is among the elite academics publishing in the area of forecasting according to IDEAS!.

Big Data: New Tricks for Econometrics, as its title suggests, uses the wealth of data now being generated (Google is a good example) as a pretext for promoting techniques that are more well-known in machine learning circles, than in econometrics or standard statistics, at least as understood by economists.

First, the sheer size of the data involved may require more sophisticated 18 data manipulation tools. Second, we may have more potential predictors than appropriate for estimation, so we need to do some kind of variable selection. Third, large data sets may allow for more flexible relationships than simple linear models. Machine learning techniques such as decision trees, support vector machines, neural nets, deep learning and so on may allow for more effective ways to model complex relationships.

He handles the definitional stuff deftly, which is good, since there is not standardization of terms yet in this rapidly evolving field of data science or predictive analytics, whatever you want to call it.

Thus, “NoSQL” databases are

sometimes interpreted as meaning “not only SQL.” NoSQL databases are more primitive than SQL databases in terms of data manipulation capabilities but can handle larger amounts of data.

The essay emphasizes out-of-sample prediction and presents a nice discussion of k-fold cross validation.

1. Divide the data into k roughly equal subsets and label them by s =1; : : : ; k. Start with subset s = 1.

2. Pick a value for the tuning parameter.

3. Fit your model using the k -1 subsets other than subset s.

4. Predict for subset s and measure the associated loss.

5. Stop if s = k, otherwise increment s by 1 and go to step 2.

Common choices for k are 10, 5, and the sample size minus 1 (“leave one out”). After cross validation, you end up with k values of the tuning parameter and the associated loss which you can then examine to choose an appropriate value for the tuning parameter. Even if there is no tuning parameter, it is useful to use cross validation to report goodness-of-t measures since it measures out-of-sample performance which is what is typically of interest.

Varian remarks that Test-train and cross validation, are very commonly used in machine learning and, in my view, should be used much more in economics, particularly when working with large datasets

But this essay is by no means methodological, and presents several nice worked examples, showing how, for example, regression trees can outperform logistic regression in analyzing survivors of the sinking of the Titanic – the luxury ship, and how several of these methods lead to different imputations of significance to the race factor in the Boston Housing Study.

The essay also presents easy and good discussions of bootstrapping, bagging, boosting, and random forests, among the leading examples of “new” techniques – new to economists.

For the statistics wonks, geeks, and enthusiasts among readers, here is a YouTube presentation of the paper cited above with extra detail.


Some Observations on Cluster Analysis (Data Segmentation)

Here are some notes and insights relating to clustering or segmentation.

1. Cluster Analysis is Data Discovery

Anil Jain’s 50-year retrospective on data clustering emphasizes data discovery – Clustering is inherently an ill-posed problem where the goal is to partition the data into some unknown number of clusters based on intrinsic information alone.

Clustering is “ill-posed” because identifying groups turns on the concept of similarity, and there are, as Jain highlights, many usable distance metrics to deploy, defining similarity of groups.

Also, an ideal cluster is compact and isolated, but, implicitly, this involves a framework of specific dimensions or coordinates, which may themselves be objects of choice. Thus, domain knowledge almost always comes into play in evaluating any specific clustering.

The importance of”subjective” elements is highlighted by research into wines, where, according to Gallo research, the basic segments are sweet and fruity, light body and fruity, medium body and rich flavor, medium body and light oak, and full body and robust flavor.

K-means clustering on chemical features of wines may or may not capture these groupings – but they are compelling to Gallo product development and marketing.

The domain expert looks over the results of formal clustering algorithms and makes the judgment call as to how many significant clusters they are and what they are.

2. Cluster Analysis is Unsupervised Learning

Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning).

Supervised  learning means is that there are classification labels in the dataset.

Linear discriminant analysis, as developed by the statistician Fisher, maximizes the distances between points in different clusters.

K-means clustering, on the other hand, classically minimizes the distances between the points in each cluster and the centroids of these clusters.

This is basically the difference between the inner products and the outer product of the relevant vectors.

3. Data reduction through principal component analysis can be helpful to clustering

K-means clustering in the sub-space defined by the first several principal components can boost a cluster analysis. I offer an example based on the University of California at Irving (UCI) Machine Learning Depository, where it is possible to find the Wine database.


The Wine database dates from 1996 and involves 14 variables based on a chemical analysis of wines grown in the same region of Italy, but derived from three different cultivars.

Dataset variables include (1) alcohol, (2) Malic acid, (3) Ash, (4) Alcalinity of ash, (5) Magnesium, (6) Total phenols, (7) Flavanoids, (8) Nonflavanoid phenols, (9) Proanthocyanins, (10) Color intensity, (11) Hue, (12) OD280/OD315 of diluted wines, and (13) Proline.

The dataset also includes, in the first column, a 1, 2, or 3 for the cultivar whose chemical properties follow. A total of 178 wines are listed in the dataset.

To develop my example, I first run k-means clustering on the original wine dataset – without, of course, the first column designating the cultivars. My assumption is that cluster analysis ought to provide a guide as to which cultivar the chemical data come from.

I ran a search for three segments.

The best match I got was in predicting membership of wines from the first cultivar. I get an approximately 78 percent hit rate – 77.9 percent of the first cultivar are correctly identified by a k-means cluster analysis of all 13 variables in the Wine dataset.  The worst performance is with the third cultivar – less than 50 percent accuracy in identification of this segment. There were a lot of false positives, in other words – a lot of instances where the k-means analysis of the whole dataset indicates that a wine stems from the third cultivar, when in fact it does not.

For comparison, I calculate the first three principal components, after standardizing the wine data. Then, I run the k-means clustering algorithm on the scores produced by these first three or the three most important principal components.

There is dramatic improvement.

I achieve a 92 percent score in predicting association with the first cultivar, and higher scores in predicting the next two cultivars.

The Matlab code for this is straight-forward. After importing the wine data from a spreadsheet with  x=xlsread(“Wine”), I execute IDX=kmeans(y,3), where the data matrix y is just the original data x stripped of its first column. The result matrix IDX gives the segment numbers of the data, organized by row. These segment values can be compared with the cultivar number to see how accurate the segmentation is. The second run involved using [COEFF, SCORE, latent] = princomp(zscore(y)) grabbing the first three SCORE’s and then segmenting them with the kmeans(3SCORE,3) command. It is not always this easy, but this example, which is readily replicated, shows that in some cases, radical improvements in segmentation can be achieved by working just with a subspace determined by the first several principal components.

4. More on Wine Segmentation

How Gallo Brings Analytics Into The Winemaking Craft is a case study on data mining as practiced by Gallo Wine, and is almost a text-book example of Big Data and product segmentation in the 21st Century.

Gallo’s analytics maturation mirrors the broader IT industry’s move from rear-view mirror reporting to predictive and proactive analytics. Gallo uses the deep insight it gets from all of this analytics to develop new breakout brands.

Based on consumer surveys at tasting events and in tasting rooms at its California and Washington vineyards, Gallo sees five core wine style clusters:

  • sweet and fruity
  • light body and fruity
  • medium body and rich flavor
  • medium body and light oak
  • full body and robust flavor

Gallo maps its own and competitors’ products to these clusters, and correlates them with internal sales data and third-party retail trend data to understand taste preferences and emerging trends in different markets.

In one brand development effort, Gallo spotted big potential demand for a blended red wine that would appeal to the first three of its style clusters. It used extensive knowledge of the flavor characteristics of more than 5,000 varieties of grapes and data on varietal business fundamentals—like the availability and cost patterns of different grapes from season to season—to come up with the Apothic brand last year. After just a year on the market, Apothic is expected to sell 1 million cases with the help of a new white blend.

The Cheapskates Wine Guide, incidentally, has a fairly recent and approving post on Apothic.

Unfortunately, Gallo is not inclined to share its burgeoning deep data on wine preferences and buying by customer type and sales region.

But there is an open access source for market research and other datasets for testing data science and market research techniques.

5. The Optimal Number of Clusters in K-means Clustering

Two metrics for assessing the number of clusters in k-means clustering employ the “within-cluster sum of squares” W(k) and the “between-cluster sum of squares” B(k) – where k indicates the number of clusters.

These metrics are the CH index and Hartigan’s index applied here to the “Wine” database from the Machine Learning databases at Cal Irvine.

Recall that the Wine data involve 14 variables from a chemical analysis of 178 wines grown in a region of Italy, derived from three cultivars. The dataset includes, in the first column, a 1,2, or 3 indicating the cultivar of each wine.

This dataset is ideal for supervised learning with, for example, linear discriminant analysis, but it also supports an interesting exploration of the performance of k-means clustering – an unsupervised learning technique.

Previously, we used the first three principal components to cluster the Wine dataset, arriving at fairly accurate predictions of which cultivar the wines came from.

Matlab’s k-means algorithm returns several relevant pieces of data, including the cluster assignment for each case or, here, 13 dimensional point (stripping off the cultivar identifier at the start), but also information on the within-cluster and total cluster sum of squares, and, of course, the final centroid values.

This makes computation of the CH index and Hartigan’s index easy, and, in the case of the Wine dataset, leads to the following graph.


The CH index has an “elbow” at k=3 clusters. The possibility that this indicates the optimal number of clusters is enhanced by the rather abupt change of slope in Hartigan’s index.

Of course, we know that there are good grounds for believing that three is the optimal number of clusters .

With this information, then, one could feel justified in looking at and clustering the first three principal components, which would produce a good indicator of the cultivar of the wine.


The objective of k-means clustering is to minimize the within-cluster sum of squares or the squared differences between the points belonging to a cluster and the associated centroid of that cluster.

Thus, suppose we have m observations on n-dimensional points xi=(x1i, x2i,..,xni), and, for this discussion, assume these points are mean-centered and divided by their column standard deviations.

Designate the centroids by the vector m=(m1, m2, …,mk). These are the average values for each variable for all the points assigned to a cluster.

Then, the objective is to minimize,


This problem is, in general, NP difficult, but the standard algorithm is straight-forward. The number of clusters k is selected. Then, an initial set of centroids is determined, by one means or another, and distances of points to these centroids are calculated. Points closest to the centroids are assigned to the cluster associated with that centroid. Then, the centroids are recalculated, and distances between the points and centroids are computed again. This loops until there are no further changes in centroid values or cluster assignment.

Minimization of the above objective function frequently leads to a local minima, so usually algorithms develop multiple assignments of the initial centroids, with the final solution being the cluster assignment with the lowest value of W(k).

Calculating the CH and Hartigan Indexes

Let me throw up a text box from the excellent work Computational Systems Biology of Cancer .


Making allowances for slight variation in notation, one can see that the CH index is a ratio of the between-cluster sum of squares and the within-cluster sum of squares, multiplied by constants determined by the number of clusters and number of cases or observations in the dataset.

Also, Hartigan’s index is a ratio of successive within-cluster sum of squares, divided by constants determined by the number of observations and number of clusters.

I focused on clustering the first ten principal components of the Wine data matrix in the graph above, incidentally.

Let me also quote from the above-mentioned work, which is available as a Kindle download from Amazon,

A generic problem with clustering methods is to choose the number of groups K. In an ideal situation in which samples would be well partitioned into a finite number of clusters and each cluster would correspond to the various hypothesis made by a clustering method (e.g. being a spherical Gaussian distribution), statistical criteria such as the Bayesian Information Criterion (BIC) can be used to select the optimal K consistently (Kass and Wasserman, 1995; Pelleg and Moore, 2000). On real data, the assumptions underlying such criteria are rarely met, and a variety of more or less heuristic criteria have been proposed to select a good number of clusters K (Milligan and Cooper, 1985;Gordon, 1999). For example, given a sequence of partitions C1, C2, …with k = 1, 2,…  groups, a useful and simple method is to monitor the decrease in W( Ck) (see Box 5.3) with k, and try to detect an elbow in the curve, i.e. a transition between sharp and slow decrease (see Figure 5.4, upper left). Alternatively, several statistics have been proposed to precisely detect a change of regime in this curve (see Box 5.4). For hierarchical clustering methods, the selection of clusters is often performed by searching the branches of dendrograms which are stable with respect to within- and between-group distance (Jain et al., 1999; Bertoni and Valentini, 2008).

It turns out that key references here are available on the Internet.

Thus, the Aikake or Bayesian Information Criteria applied to the number of clusters is described in Dan Pelleg and Andrew Moore’s 2000 paper X-means: Extending K-means with Efficient Estimation of the Number of Clusters.

Robert Tibshirani’s 2000 paper on the gap statistic also is available.

Another example of the application of these two metrics is drawn from this outstanding book on data analysis in cancer research, below.


These methods of identifying the optimal number of clusters might be viewed as heuristic, and certainly are not definitive. They are, however, possibly helpful in a variety of contexts.

Top Forecasting Institutions and Researchers According to IDEAS!

Here is a real goldmine of research on forecasting.

IDEAS! is a RePEc service hosted by the Research Division of the Federal Reserve Bank of St. Louis.

This website compiles rankings on authors who have registered with the RePEc Author Service, institutions listed on EDIRC, bibliographic data collected by RePEc, citation analysis performed by CitEc and popularity data compiled by LogEc – under the category of forecasting.

Here is a list of the top fifteen of the top 10% institutions in the field of forecasting, according to IDEAS!. The institutions are scored based on a weighted sum of all authors affiliated with the respective institutions (click to enlarge).


The Economics Department of the University of Wisconsin, the #1 institution, lists 36 researchers who claim affiliation and whose papers are listed under the category forecasting in IDEAS!.

The same IDEAS! Webpage also lists the top 10% authors in the field of forecasting. I extract the top 20 of this list here below. If you click through on an author, you can see their list of publications, many of which often are available as PDF downloads.


This is a good place to start in updating your knowledge and understanding of current thinking and contextual issues relating to forecasting.

The Applied Perspective

For an applied forecasting perspective, there is Bloomberg with this fairly recent video on several top economic forecasters providing services to business and investors.

I believe Bloomberg will release extensive, updated lists of top forecasters by country, based on a two year perspective, in a few weeks.

What’s the Lift of Your Churn Model? – Predictive Analytics and Big Data

Churn analysis is a staple of predictive analytics and big data. The idea is to identify attributes of customers who are likely leave a mobile phone plan or other subscription service, or, more generally, switch who they do business with. Knowing which customers are likely to “churn” can inform customer retention plans. Such customers, for example, may be contacted in targeted call or mailing campaigns with offers of special benefits or discounts.

Lift is a concept in churn analysis. The lift of a target group identified by churn analysis reflects the higher proportion of customers who actually drop the service or give someone else their business, when compared with the population of customers as a whole. If, typically, 2 percent of customers drop the service per month, and, within the group identified as “churners,” 8 percent drop the service, the “lift” is 4.

In interesting research, originally published in the Harvard Business Review, Gregory Piatetsky-Shapiro questions the efficacy of big data applied to churn analysis – based on an estimation of costs and benefits.

We looked at some 30 different churn-modeling efforts in banking and telecom, and surprisingly, although the efforts used different data and different modeling algorithms, they had very similar lift curves. The lists of top 1% likely defectors had a typical lift of around 9-11. Lists of top 10% defectors all had a lift of about 3-4. Very similar lift curves have been reported in other work. (See here and here.) All this suggests a limiting factor to prediction accuracy for consumer behavior such as churn.

Backtracking through earlier research by Piatetsky-Shapiro and his co-researchers, there is this nugget,

For targeted marketing campaigns, a good model lift at T, where T is the target rate in the overall population, is usually sqrt(1/T) +/- 20%.

So, if the likely “churners” are 5 percent of the customer group, a reasonable expectation of the lift that can be obtained from churn analysis is 4.47. This means probably no more than 25 percent of the target group identified by the churn analysis will, in fact, do business elsewhere in the defined period.

This is a very applied type of result, based on review of 30 or more studies.

But the point Piatetsky-Shapiro make is that big data probably can’t push these lift numbers much higher, because of the inherent randomness in the behavior of consumers. And small gains to existing methods simply do not meet a cost/benefit criterion.

Some Israeli researchers may in fact best these numbers with a completely different approach based on social network analysis. Their initial working hypothesis was that social influence on churn is highly dominant in relatively tight social groups. Their approach is clearly telecommunications-based, since they analyzed patterns of calling between customers, identifying networks of callers who had more frequent communications.

Still, there is a good argument for an evolution from standard churn analysis to predictive analytics that uncovers the value-at-risk in the customer base, or even the value that can be saved by customer retention programs. Customers who have trouble paying their bill, for example, might well be romanced less strongly by customer retention efforts, than premium customers.


Along these lines, I enjoyed reading the Stochastic Solutions piece on who can be saved and who will be driven away by retention activity, which is responsible for the above graphic.

It has been repeatedly demonstrated that the very act of trying to ‘save’ some customers provokes them to leave. This is not hard to understand, for a key targeting criterion is usually estimated churn probability, and this is highly correlated with customer dissatisfaction. Often, it is mainly lethargy that is preventing a dissatisfied customer from actually leaving. Interventions designed with the express purpose of reducing customer loss can provide an opportunity for such dissatisfaction to crystallise, provoking or bringing forward customer departures that might otherwise have been avoided, or at least delayed. This is especially true when intrusive contact mechanisms, such as outbound calling, are employed. Retention programmes can be made more effective and more profitable by switching the emphasis from customers with a high probability of leaving to those likely to react positively to retention activity.

This is a terrific point. Furthermore,

..many customers are antagonised by what they feel to be intrusive contact mechanisms; indeed, we assert without fear of contradiction that only a small proportion of customers are thrilled, on hearing their phone ring, to discover that the caller is their operator. In some cases, particularly for customers who are already unhappy, such perceived intrusions may act not merely as a catalyst but as a constituent cause of churn.

Bottom-line, this is among the most interesting applications of predictive analytics.

Logistic regression is a favorite in analyzing churn data, although techniques range from neural networks to regression trees.

Measuring the Intelligence of Crowds

Researchers at Microsoft Research in the UK and Cambridge University report some fascinating and potentially useful results on crowdsourcing, based on a study of aggregating questions from a standard IQ test on Amazon’s Mechanical Turk (AMT).

The AMT site provides a place where workers can find problems that requesters have set up for crowdsourcing.

The introductory page to the site looks like this (click to enlarge).


So here’s an interesting way for people to make some money working from home, at their own hours, and yet stay busy. I’d like to look more deeply into this in a future post, but what these Crowd IQ researchers did is divvy up the questions from a widely utilized IQ test on the AMT site. They studied the effects of changing several parameters on their measures of Crowd IQ, but basically found that, with five or more reputable workers in a group, the Crowd IQ was usually higher than that of the individual workers in the group.

The Abstract for their 2012 study Crowd IQ: Measuring the Intelligence of Crowdsourcing Platforms describes the research and findings succinctly:

We measure crowdsourcing performance based on a standard IQ questionnaire, and examine Amazon’s Mechanical Turk (AMT) performance under different conditions. These include variations of the payment amount offered, the way incorrect responses affect workers’ reputations, threshold reputation scores of participating AMT workers, and the number of workers per task. We show that crowds composed of workers of high reputation achieve higher performance than low reputation crowds, and the effect of the amount of payment is non-monotone—both paying too much and too little affects performance. Furthermore, higher performance is achieved when the task is designed such that incorrect responses can decrease workers’ reputation scores. Using majority vote to aggregate multiple responses to the same task can significantly improve performance, which can be further boosted by dynamically allocating workers to tasks in order to break ties.

The IQ test is Raven’s Standard Progressive Matrices (SPM). If you want to take the test, look here.

SPM is a nonverbal, multiple-choice intelligence test based on the theory of general ability. The general setup is as in the following example.


Free riders are an interesting problem in a site like the Mechanical Turk. So, if people get paid by the number of correct answers, some simply select responses at random to maximize the speed at which they can put up answers. Because of this, AMT has a reputation mechanism indicating the expected quality of work of a worker, based on his or her past performance.

This research is has real-world implications. For example, increasing the payment for tasks too much results in actually diminuishing the quality of the answers, for a variety of reasons the authors consider.

The “workers” in this AMT-based study did not consult with each other about the answers, but were grouped into teams somehow by the researchers.

Here is a chart showing the increase in crowd IQ with the number of people in the group.


Here a HIT refers to a Human Intelligence Task.


First, experiment and monitor the performance. Our results suggest that relatively small changes to the parameters of the task may result in great changes in crowd performance. Changing parameters of the task (e.g. reward, time limits, reputation rage) and observing changes in performance may allow you to greatly increase performance. Second, make sure to threaten workers’ reputation by emphasizing that their solutions will be monitored and wrong responses rejected. Obviously, in a real-world setting it may be hard to detect free-riders without using a “gold-set” of test questions to which the requester already knows the correct response. However, designing and communicating HIT rejection conditions can discourage free riding or make it risky and more difficult. For instance, in the case of translation tasks requesters should determine what is not acceptable (e.g. using Google Translate) and may suggest that the response quality would be monitored and solutions of low quality would be rejected. Third, do not over-pay. Although the reward structure obviously depends on the task at hand and the expected amount of effort required to solve it, our results suggest that pricing affects not only the ability to s source enough workers to perform the task but also the quality of the obtained results. Higher rewards are likely to encourage a free-riding behavior and may affect the cognitive abilities of workers by increasing psychological pressure. Thus, for long term projects or tasks that are run repeatedly in a production environment, we believe it is worthwhile to experiment with the reward scheme in order to
find an optimum reward level. Fourth, aggregate multiple solutions to each HIT, preferably using an adaptive sourcing scheme. Even the simplest aggregation method – majority voting – has a potential to greatly improve the quality of the solution. In the context of more complicated tasks, e.g. translations, requesters may consider a two-stage design in which they first request several solutions, and then use another batch of workers to vote for the best one. Additionally, requesters may consider inspecting the responses provided by individuals that often disagree with the crowd – they might be coveted geniuses or free-riders deserving rejection.

Interesting stuff, and makes you want to try crowdsourcing.

Boosting Time Series

If you learned your statistical technique more than ten years ago, consider it necessary to learn a whole bunch of new methods. Boosting is certainly one of these.

Let me pick a leading edge of this literature here – boosting time series predictions.


Let’s go directly to the performance improvements.

In Boosting multi-step autoregressive forecasts, (Souhaib Ben Taieb and Rob J Hyndman, International Conference on Machine Learning (ICML) 2014) we find the following Table applying boosted time series forecasts to two forecasting competition datasets –


The three columns refer to three methods for generating forecasts over horizons of 1-18 periods (M3 Competition and 1-56 period (Neural Network Competition). The column labeled BOOST is, as its name suggests, the error metric for a boosted time series prediction. Either by the lowest symmetric mean absolute percentage error or a rank criterion, BOOST usually outperforms forecasts produced recursively from an autoregressive (AR) model, or forecasts from an AR model directly mapped onto the different forecast horizons.

There were a lot of empirical time series involved in these two datasets –

The M3 competition dataset consists of 3003 monthly, quarterly, and annual time series. The time series of the M3 competition have a variety of features. Some have a seasonal component, some possess a trend, and some are just fluctuating around some level. The length of the time series ranges between 14 and 126. We have considered time series with a range of lengths between T = 117 and T = 126. So, the number of considered time series turns out to be M = 339. For these time series, the competition required forecasts for the next H = 18 months, using the given historical data. The NN5 competition dataset comprises M = 111 time series representing roughly two years of daily cash withdrawals (T = 735 observations) at ATM machines at one of the various cities in the UK. For each time series, the  competition required to forecast the values of the next H = 56 days (8 weeks), using the given historical data.

This research, notice of which can be downloaded from Rob Hyndman’s site, builds on the methodology of Ben Taieb and Hyndman’s recent paper in the International Journal of Forecasting A gradient boosting approach to the Kaggle load forecasting competition. Ben Taieb and Hyndman’s submission came in 5th out of 105 participating teams in this Kaggle electric load forecasting competition, and used boosting algorithms.

Let me mention a third application of boosting to time series, this one from Germany. So we have Robinzonov, Tutz, and Hothorn’s Boosting Techniques for Nonlinear Time Series Models (Technical Report Number 075, 2010 Department of Statistics University of Munich) which focuses on several synthetic time series and predictions of German industrial production.

Again, boosted time series models comes out well in comparisons.


GLMBoost or GAMBoost are quite competitive at these three forecast horizons for German industrial production.

What is Boosting?

My presentation here is a little “black box” in exposition, because boosting is, indeed, mathematically intricate, although it can be explained fairly easily at a very general level.

Weak predictors and weak learners play an important role in bagging and boosting –techniques which are only now making their way into forecasting and business analytics, although the machine learning community has been discussing them for more than two decades.

Machine learning must be a fascinating field. For example, analysts can formulate really general problems –

In an early paper, Kearns and Valiant proposed the notion of a weak learning algorithm which need only achieve some error rate bounded away from 1/2 and posed the question of whether weak and strong learning are equivalent for efficient (polynomial time) learning algorithms.

So we get the “definition” of boosting in general terms:

Boosting algorithms are procedures that “boost” low-accuracy weak learning algorithms to achieve arbitrarily high accuracy.

And a weak learner is a learning method that achieves only slightly better than chance correct classification of binary outcomes or labeling.

This sounds like the best thing since sliced bread.

But there’s more.

For example, boosting can be understood as a functional gradient descent algorithm.

Now I need to mention that some of the most spectacular achievements in boosting come in classification. A key text is the recent book Boosting: Foundations and Algorithms (Adaptive Computation and Machine Learning series) by Robert E. Schapire and Yoav Freund. This is a very readable book focusing on AdaBoost, one of the early methods and its extensions. The book can be read on Kindle and is starts out –


So worth the twenty bucks or so for the download.

The papers discussed above vis a vis boosting time series apply p-splines in an effort to estimate nonlinear effects in time series. This is really unfamiliar to most of us in the conventional econometrics and forecasting communities, so we have to start conceptualizing stuff like “knots” and component-wise fitting algortihms.

Fortunately, there is a canned package for doing a lot of the grunt work in R, called mboost.

Bottom line, I really don’t think time series analysis will ever be the same.

Analytics 2013 Conference in Florida

Looking for case studies of data analytics or predictive analytics, or for Big Data applications?

You can hardly do better, on a first cut, than peruse the material now available from October’s Analytics 2013 Conference, held at the Hyatt Regency Hotel in Orlando, Florida.

Presented by SAS, dozens of presentations and posters from the Conference can be downloaded as zip files, unbundling as PDF files.

Download the conference presentations and poster presentations (.zip)

I also took an hour to look at the Keynote Presentation of Dr. Sven Crone of Lancaster University in the UK, now available on YouTube.

Crone, who also is affiliated with the Lancaster Centre for Forecasting, gave a Keynote which was, in places, fascinating, and technical and a little obscure elsewhere – worth watching if you time, or can run it in the background while you sort through your desk, for example.

A couple of slides caught my attention.

One segment gave concrete meaning to the explosion of data available to forecasters and analysts. For example, for electric power load forecasting, it used be the case that you had, perhaps, monthly total loads for the system or several of its parts, or perhaps daily system loads. Now, Crone notes the data to be modeled has increased by orders of magnitude, for example, with Smart Meters recording customer demand at fifteen minute intervals.


Another part of Crone’s talk which grabbed my attention was his discussion of forecasting techniques employed by 300 large manufacturing concerns, some apparently multinational in scale. The following graph – which is definitely obscure by virtue of its use of acronyms for types of forecasting systems, like SOP for Sales and Operation Planning – highlights that almost no company uses anything except the simplest methods for forecasting, relying largely on judgmental approaches. This aligns with a survey I once did which found almost no utilities used anything except the simplest per capita forecasting approaches. Perhaps things have changed now.

Analytics13A1 Analytics13A2

Crone suggests relying strictly on judgment becomes sort of silly in the face of the explosion of information now available to management.

Another theme Crone spins in an amusing, graphic way is that the workhorses of business forecasting, such as exponential smoothing, are really products from many decades ago. He uses funny pics of old business/office environments, asking whether this characterizes your business today.

The analytic meat of the presentation comes with exposition of bagging and boosting, as well as creative uses for k-means clustering in time series analysis.

At which point he descends into a technical wonderland of complexity.

Incidentally, Analytics 2014 is scheduled for Frankfurt, Germany June 4-5 this coming Spring.

Watch here for my follow-on post on boosting time series.

Links January 16, 2014

Economic Outlook

Central Station: January Fed Taper on Track

Federal Reserve officials, including a strong supporter of their easy money policies, have so far brushed off the weak employment report as a blip in an otherwise strengthening economic recovery. This suggests they are likely to stick to their plan to gradually wind down their bond-buying program this year as the recovery picks up momentum…

“True, the December jobs report was disappointing,” said Chicago Fed President Charles Evans, who has been a champion of aggressive central bank efforts to spur stronger economic growth. But, he added, “the recent data on economic activity generally have been encouraging” and “importantly, the labor market has improved.”

He said the tentative plan to reduce the monthly bond buys in $10 billion increments “seems quite reasonable” and “it makes sense to continue that in January.

Atlanta Fed President Dennis Lockhart, a centrist on Fed policies, said Monday the December employment report hadn’t shaken his expectation that the central bank would stick to the taper plan.

Meanwhile two opponents of the bond-buying program, Dallas Fed President Richard Fisher and Philadelphia Fed President Charles Plosser indicated in separate speeches Tuesday they were all for winding it down.

Given that chorus, it appears probable Fed officials will trim their monthly bond purchases to $65 billion from $75 billion at their next policy meeting January 28-29 meeting. Now that tapering is under way, the bar for stopping the process seems quite high.

Big Data

Big Data systems are making a difference in the fight against cancer Open source, distributed computing tools speedup an important processing pipeline for genomics data

Big Data to increase e-tailer profits

As tablet and smartphone usage becomes more widespread, shopping online has become quicker and easier and the speed of delivery has become critical in the online fulfilment race.

The group of researchers, which includes Arne Strauss, Assistant Professor of Operational Research at Warwick Business School, propose an analytic approach that will predict when people want their shopping delivered depending on what delivery prices (or incentives such as discounts or loyalty points) are being quoted for different delivery time slots.

It takes into account accepted orders to date as well as orders that are still expected to come in….

The new approach was tested using real shopping data from a major e-grocer in the UK over a period of six months and generated a four per cent increase in profits on average in a simulation study, outperforming traditional delivery pricing policies.

Big Data and Data Science Books – A Baker’s Dozen – from Analytic Bridge

  1. Big Data: A Revolution That Will Transform How We Live, Work, and T…, by Viktor Mayer-Schonberger and Kenneth Cukier
  2. The Signal and the Noise: Why So Many Predictions Fail-but Some Don’t, by Nate Silver
  3. Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie…, by Eric Siegel
  4. The Human Face of Big Data, by Rick Smolan and Jennifer Erwitt
  5. Data Science for Business: What you need to know about data mining …, by Foster Provost and Tom Fawcett
  6. The Black Swan: The Impact of the Highly Improbable, by Nassim Nicholas Taleb
  7. Competing on Analytics: The New Science of Winning, by Thomas H. Davenport and Jeanne G. Harris
  8. Super Crunchers: Why Thinking-by-Numbers is the New Way to Be Smart, by Ian Ayres
  9. Big Data Marketing: Engage Your Customers More Effectively and Driv…, by Lisa Arthur
  10. Journeys to Data Mining: Experiences from 15 Renowned Researchers, by Mohamed Medhat Gaber (editor)
  11. The Fourth Paradigm: Data-Intensive Scientific Discovery, by T.Hey, S.Tansley, and K.Tolle (editors)
  12. Seven Databases in Seven Weeks: A Guide to Modern Databases and the…, by Eric Redmond and Jim Wilson
  13. Data Mining And Predictive Analysis: Intelligence Gathering And Cri…, by Colleen McCue

Social Media

The History and Evolution of Social Media an Infographic (click to enlarge)


If you like this, there are many more infographics focusing on social media at – some in Spanish. Also check out Top 10 Infographics of 2013 [Daily Infographic].


Economics: Science, Craft, or Snake Oil? – nice, but sort of equivocating essay by Dani Rodrik from his new offices at the Princeton Institute for Advanced Studies. Answer – all of the above.

Origin and Importance of the Salesman in US – A piece of US business history and culture. I like this.

The Evolution of Kaggle

Kaggle is evolving in industry-specific directions, although it still hosts general data and predictive analytics contests.

“We liked to say ‘It’s all about the data,’ but the reality is that you have to understand enough about the domain in order to make a business,” said Anthony Goldbloom, Kaggle’s founder and chief executive. “What a pharmaceutical company thinks a prediction about a chemical’s toxicity is worth is very different from what Clorox thinks shelf space is worth. There is a lot to learn in each area.”

Oil and gas, which for Kaggle means mostly fracking wells in the United States, have well-defined data sets and a clear need to find working wells. While the data used in traditional oil drilling is understood, fracking is a somewhat different process. Variables like how long deep rocks have been cooked in the earth may matter. So does which teams are working the fields, meaning early-stage proprietary knowledge is also in play. That makes it a good field to go into and standardize.

(as reported in

This December 2013 change of direction pushed out Jeremy Howard, Kaggle’s former Chief Data Scientist, who now says he is,

focusing on building new kinds of software that could better learn about the data it was crunching and offer its human owners insights on any subject.

“A lone wolf data scientist can still apply his knowledge to any industry,” he said. “I’m spending time in areas where I have no industrial knowledge and finding things. I’m going to have to build a company, but first I have to spend time as a lone wolf.”

A year or so ago, the company evolved into a service-provider with the objective of linking companies, top competitors and analytical talent, and the more than 100,000 data scientists who compete on its platform.

So Kaggle now features CUSTOMER SOLUTIONS ahead of COMPETITIONS at the head of its homepage, saying We’re the global leader in solving business challenges through predictive analytics. The homepage also features logos from Facebook GE, MasterCard, and NASA, as well as a link Compete as a data scientist for fortune, fame and fun ».

But a look at the competitions underway currently highlight the fact that just a few pay a prize now.


Presumeably, companies looking for answers are now steered into the Kaggle network. The Kaggle Team numbers six analysts with experience in several industries, and the Kaggle Community includes scores of data and predictive analytics whizzes, many with “with multiple Kaggle wins.”

Here is a selection of Kaggle Solutions.


This video gives you a good idea of the current focus of the company.

This is a big development in a way, and supports those who point to the need for industry-specific knowledge and experience to do a good job of data analytics.