Tag Archives: crowdsourcing

Superforecasting – The Art and Science of Prediction

Philip Tetlock’s recent Superforecasting says, basically, some people do better at forecasting than others and, furthermore, networking higher performing forecasters, providing access to pooled data, can produce impressive results.

This is a change from Tetlock’s first study – Expert Political Judgment – which lasted about twenty years, concluding, famously, ‘the average expert was roughly as accurate as a dart-throwing chimpanzee.”

Tetlock’s recent research comes out of a tournament sponsored by the Intelligence Advanced Research Projects Activity (IARPA). This forecasting competition fits with the mission of IARPA, which is to improve assessments by the “intelligence community,” or IC. The IC is a generic label, according to Tetlock, for “the Central Intelligence Agency, the National Security Agency, the Defense Intelligence Agency, and thirteen other agencies.”

It is relevant that the IC is surmised (exact figures are classified) to have “a budget of more than $50 billion .. [and employ] one hundred thousand people.”

Thus, “Think how shocking it would be to the intelligence professionals who have spent their lives forecasting geopolical events – to be beaten by a few hundred ordinary people and some simple algorithms.”

Of course, Tetlock reports, this actually happened – “Thanks to IARPA, we now know a few hundred ordinary people and some simple math can not only compete with professionals supported by multibillion-dollar apparatus but also beat them.”

IARPA’s motivation, apparently, traces back to the “weapons of mass destruction (WMD)” uproar surrounding the Iraq war –

“After invading in 2003, the United States turned Iraq upside down looking for WMD’s but found nothing. It was one of the worst – arguable the worst – intelligence failure in modern history. The IC was humiliated. There were condemnations in the media, official investigations, and the familiar ritual of intelligence officials sitting in hearings ..”

So the IC needs improved methods, including utilizing “the wisdom of crowds” and practices of Tetlock’s “superforecaster” teams.

Unlike the famous M-competitions, the IARPA tournament collates subjective assessments of geopolitical risk, such as “will there be a fatal confrontation between vessels in the South China Sea” or “Will either the French or Swiss inquiries find elevated levels of polonium in the remains of Yasser Arafat’s body?”

Tetlock’s book is entertaining and thought-provoking, but many in business will page directly to the Appendix – Ten Commandments for Aspiring Superforecasters.

    1. Triage – focus on questions which are in the “Goldilocks” zone where effort pays off the most.
    2. Break seemingly intractable problems into tractable sub-problems. Tetlock really explicates this recommendation with his discussion of “Fermi-izing” questions such as “how many piano tuners there are in Chicago?.” The reference here, of course, is to Enrico Fermi, the nuclear physicist.
    3. Strike the right balance between inside and outside views. The outside view, as I understand it, is essentially “the big picture.” If you are trying to understand the likelihood of a terrorist attack, how many terrorist attacks have occurred in similar locations in the past ten years? Then, the inside view includes facts about this particular time and place that help adjust quantitative risk estimates.
    4. Strike the right balance between under- and overreacting to evidence. The problem with a precept like this is that turning it around makes it definitely false. Nobody would suggest “do not strike the right balance between under- and overreacting to evidence.” I guess keep the weight of evidence in mind.
    5. Look for clashing causal forces at work in each problem. This reminds me of one of my models of predicting real world developments – tracing out “threads” or causal pathways. When several “threads” or chains of events and developments converge, possibility can develop into likelihood. You have to be a “fox” (rather than a hedgehog) to do this effectively – being open to diverse perspectives on what drives people and how things happen.
    6. Strive to distinguish as many degrees of doubt as the problem permits but no more. Another precept that could be cast as a truism, but the reference is to an interesting discussion in the book about how the IC now brings quantitative probability estimates to the table, when developments – such as where Osama bin Laden lives – come under discussion.
    7. Strike the right balance between under- and overconfidence, between prudence and decisiveness. I really don’t see the particular value of this guideline, except to focus on whether you are being overconfident or indecisive. Give it some thought?
    8. Look for the errors behind your mistakes but beware of rearview-mirror hindsight biases. I had an intellectual mentor who served in the Marines and who was fond of saying, “we are always fighting the last war.” In this regard, I’m fond of the saying, “the only certain thing about the future is that there will be surprises.”
    9. Bring out the best in others and let others bring out the best in you. Tetlock’s following sentence is more to the point – “master the fine art of team management.”
  • Master the error-balancing cycle. Good to think about managing this, too.

Puckishly, Tetlocks adds an 11th Commandment – don’t treat commandments as commandments.

Great topic – forecasting subjective geopolitical developments in teams. Superforecasting touches on some fairly subtle points, illustrated with examples. I think it is well worth having on the bookshelf.

There are some corkers, too, like when Tetlock’s highlights the recommendations of 2nd Century physician to Roman emperors Galen, the medical authority for more than 1000 years.

Galen once wrote, apparently,

“All who drink of this treatment recover in a short time, except those whom it does not help, who all die…It is obvious, therefore, that it fails only in incurable cases.”

Measuring the Intelligence of Crowds

Researchers at Microsoft Research in the UK and Cambridge University report some fascinating and potentially useful results on crowdsourcing, based on a study of aggregating questions from a standard IQ test on Amazon’s Mechanical Turk (AMT).

The AMT site provides a place where workers can find problems that requesters have set up for crowdsourcing.

The introductory page to the site looks like this (click to enlarge).

AMT

So here’s an interesting way for people to make some money working from home, at their own hours, and yet stay busy. I’d like to look more deeply into this in a future post, but what these Crowd IQ researchers did is divvy up the questions from a widely utilized IQ test on the AMT site. They studied the effects of changing several parameters on their measures of Crowd IQ, but basically found that, with five or more reputable workers in a group, the Crowd IQ was usually higher than that of the individual workers in the group.

The Abstract for their 2012 study Crowd IQ: Measuring the Intelligence of Crowdsourcing Platforms describes the research and findings succinctly:

We measure crowdsourcing performance based on a standard IQ questionnaire, and examine Amazon’s Mechanical Turk (AMT) performance under different conditions. These include variations of the payment amount offered, the way incorrect responses affect workers’ reputations, threshold reputation scores of participating AMT workers, and the number of workers per task. We show that crowds composed of workers of high reputation achieve higher performance than low reputation crowds, and the effect of the amount of payment is non-monotone—both paying too much and too little affects performance. Furthermore, higher performance is achieved when the task is designed such that incorrect responses can decrease workers’ reputation scores. Using majority vote to aggregate multiple responses to the same task can significantly improve performance, which can be further boosted by dynamically allocating workers to tasks in order to break ties.

The IQ test is Raven’s Standard Progressive Matrices (SPM). If you want to take the test, look here.

SPM is a nonverbal, multiple-choice intelligence test based on the theory of general ability. The general setup is as in the following example.

Ravenex

Free riders are an interesting problem in a site like the Mechanical Turk. So, if people get paid by the number of correct answers, some simply select responses at random to maximize the speed at which they can put up answers. Because of this, AMT has a reputation mechanism indicating the expected quality of work of a worker, based on his or her past performance.

This research is has real-world implications. For example, increasing the payment for tasks too much results in actually diminuishing the quality of the answers, for a variety of reasons the authors consider.

The “workers” in this AMT-based study did not consult with each other about the answers, but were grouped into teams somehow by the researchers.

Here is a chart showing the increase in crowd IQ with the number of people in the group.

MSFTcurve

Here a HIT refers to a Human Intelligence Task.

 Recommendations

First, experiment and monitor the performance. Our results suggest that relatively small changes to the parameters of the task may result in great changes in crowd performance. Changing parameters of the task (e.g. reward, time limits, reputation rage) and observing changes in performance may allow you to greatly increase performance. Second, make sure to threaten workers’ reputation by emphasizing that their solutions will be monitored and wrong responses rejected. Obviously, in a real-world setting it may be hard to detect free-riders without using a “gold-set” of test questions to which the requester already knows the correct response. However, designing and communicating HIT rejection conditions can discourage free riding or make it risky and more difficult. For instance, in the case of translation tasks requesters should determine what is not acceptable (e.g. using Google Translate) and may suggest that the response quality would be monitored and solutions of low quality would be rejected. Third, do not over-pay. Although the reward structure obviously depends on the task at hand and the expected amount of effort required to solve it, our results suggest that pricing affects not only the ability to s source enough workers to perform the task but also the quality of the obtained results. Higher rewards are likely to encourage a free-riding behavior and may affect the cognitive abilities of workers by increasing psychological pressure. Thus, for long term projects or tasks that are run repeatedly in a production environment, we believe it is worthwhile to experiment with the reward scheme in order to
find an optimum reward level. Fourth, aggregate multiple solutions to each HIT, preferably using an adaptive sourcing scheme. Even the simplest aggregation method – majority voting – has a potential to greatly improve the quality of the solution. In the context of more complicated tasks, e.g. translations, requesters may consider a two-stage design in which they first request several solutions, and then use another batch of workers to vote for the best one. Additionally, requesters may consider inspecting the responses provided by individuals that often disagree with the crowd – they might be coveted geniuses or free-riders deserving rejection.

Interesting stuff, and makes you want to try crowdsourcing.