Well, I signed up for Andrew Ng’s Machine Learning Course at Stanford. It began a few weeks ago, and is a next generation to lectures by Ng circulating on YouTube. I’m going to basically audit the course, since I started a little late, but I plan to take several of the exams and work up a few of the projects. *This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.* I like the change in format. The YouTube videos circulating on the web are lengthly, and involve Ng doing derivations on white boards. This is a more informal, expository format. Here is a link to a great short introduction to neural networks. Click on the link above this picture, since the picture itself does not trigger a YouTube. Ng’s introduction on this topic is fairly short, so here is the follow-on lecture, which starts the task of representing or modeling neural networks. I really like the way Ng approaches this is grounded in biology. I believe there is still time to sign up. **Comment on Neural Networks and Machine Learning** I can’t do much better than point to Professor Ng’s definition of machine learning – *Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you’ll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you’ll learn about some of Silicon Valley’s best practices in innovation as it pertains to machine learning and AI.* And now maybe this is the future – the robot rock band.

# Tag Archives: Bayesian methods

# Bayesian Reasoning and Intuition

In thinking about Bayesian methods, I wanted to focus on whether and how Bayesian probabilities are or can be made “intuitive.”

Or are they just numbers plugged into a formula which sometimes is hard to remember?

A classic example of Bayesian reasoning concerns breast cancer and mammograms.

1% of the women at age forty who participate in routine screening have breast cancer |

80% of women with breast cancer will get positive mammograms. |

9.6% of women with no breast cancer will also get positive mammograms |

Question – A women in this age group has a positive mammogram in a routine screening. What is the probability she has cancer?

There is a tendency for intuition to anchor on the high percentage of women with breast cancer with positive mammograms – 80 percent. In fact, this type of scenario elicits significant over-estimates of cancer probabilities among mammographers!

Bayes Theorem, however, shows that the probability of women with a positive mammogram having cancer is ** an order of magnitude** less than the percent of women with breast cancer and positive mammograms.

**By the Formula**

Recall Bayes Theorem –

Let A stand for the event a women has breast cancer, and B denote the event that a women tests positive on the mammogram.

We need the conditional probability of a positive mammogram, given that a woman has breast cancer, or P(B|A). In addition, we need the prior probability that a woman has breast cancer P(A), as well as the probability of a positive mammogram P(B).

So we know P(B|A)=0.8, and P(B|~A)=0.096, where the tilde ~ indicates “not”.

For P(B) we can make the following expansion, based on first principles –

P(B)=P(B|A)P(A)+P(B|~A)P(B)= P(B|A)P(A)+P(B|~A)(1-P(A))=0.10304

Either a woman has cancer or does not have cancer. The probability of a woman having cancer is P(A), so the probability of not having cancer is 1-P(A). These are mutually exclusive events, that is, and the probabilities sum to 1.

Putting the numbers together, we calculate the probability of a forty-year-old women with a positive mammogram having cancer is 0.0776.

So this woman has about an 8 percent chance of cancer, even though her mammogram is positive.

Survey after survey of physicians shows that this type of result in not very intuitive. Many doctors answer incorrectly, assigning a much higher probability to the woman having cancer.

**Building Intuition**

This example is the subject of a 2003 essay by Eliezer Yudkowsky – An Intuitive Explanation of Bayes’ Theorem.

As An Intuitive (and Short) Explanation of Bayes’ Theorem notes, Yudkowsky’s intuitive explanation is around 15,000 words in length.

For a shorter explanation that helps build intuition, the following table is useful, showing the crosstabs of women in this age bracket who (a) have or do not have cancer, and (b) who test positive or negative.

The numbers follow from our original data. The percentage of women with cancer who test positive is given as 80 percent, so the percent with cancer who test negative must be 20 percent, and so forth.

Now let’s embed the percentages of *true and false positives and negatives *into the table, as follows:

So 1 percent of forty year old women (who have routine screening) have cancer. If we multiply this 1 percent by the percent of women who have cancer and test positive, we get .008 or **the chances of a true positive. **Then, the chance of getting any type of positive result is .008+.99*.096=.008+.0954=0.10304.

The ratio then of the chances of a true positive to the chance of any type of positive result is 0.07763 – exactly the result following from Bayes Theorem!

This may be an easier two-step procedure than trying to develop conditional probabilities directly, and plug them into a formula.

Allen Downey lists other problems of this type, with YouTube talks on Bayesian stuff that are good for beginners.

**Closing Comments**

I have a couple more observations.

First, this analysis is consistent with a frequency interpretation of probability.

In fact, the 1 percent figure for women who are forty getting cancer could be calculated from cause of death data and Census data. Similarly with the other numbers in the scenario.

So that’s interesting.

Bayes theorem is, in some phrasing, true by definition (of conditional probability). It can just be tool for reorganizing data about observed frequencies.

The magic comes when we transition from events to variables y and parameters θ in a version like,

What is this parameter θ? It certainly does not exist in “event” space in the same way as does the event of “having cancer and being a forty year old woman.” In the batting averages example, θ is a vector of parameter values of a Beta distribution – parameters which encapsulate our view of the likely variation of a batting average, given information from the previous playing season. So I guess this is where we go into “belief space”and subjective probabilities.

In my view, the issue is always whether these techniques are predictive.

Top picture courtesy of Siemens

# Predicting Season Batting Averages, Bernoulli Processes – Bayesian vs Frequentist

Recently, Nate Silver boosted Bayesian methods in his popular book The Signal and the Noise – Why So Many Predictions Fail – But Some Don’t. I’m guessing the core application for Silver is estimating batting averages. Silver first became famous with PECOTA, a system for forecasting the performance of Major League baseball players.

Let’s assume a player’s probability p of getting a hit is constant over a season, but that it varies from year to year. He has up years, and down years. And let’s compare frequentist (gnarly word) and Bayesian approaches at the beginning of the season.

The frequentist approach is based on maximum likelihood estimation with the binomial formula

Here the n and the k in parentheses at the beginning of the expression stand for the combination of n things taken k at a time. That is, the number of possible ways of interposing k successes (hits) in n trials (times at bat) is the combination of n things taken k at a time (formula here).

If p is the player’s probability of hitting at bat, then the entire expression is the probability the player will have k hits in n times at bat.

**The Frequentist Approach **

There are a couple of ways to explain the frequentist perspective.

One is that this binomial expression is approximated to a higher and higher degree of accuracy by a normal distribution. This means that – with large enough n – the ratio of hits to total times at bat is the best estimate of the probability of a player hitting at bat – or k/n.

This solution to the problem also can be shown to follow from maximizing the likelihood of the above expression for any n and k. The large sample or asymptotic and maximum likelihood solutions are numerically identical.

The problem comes with applying this estimate early on in the season. So if the player has a couple of rough times at bat initially, the frequentist estimate of his batting average for the season at that point is zero.

**The Bayesian Approach **

The Bayesian approach is based on the posterior probability distribution for the player’s batting average. From Bayes Theorem, this is a product of the likelihood and a prior for the batting average.

Now generally, especially if we are baseball mavens, we have an idea of player X’s batting average. Say we believe it will be .34 – he’s going to have a great season, and did last year.

In this case, we can build that belief or information into a prior that is a beta distribution with two parameters α and β that generate a mean of α/(α+β).

In combination with the binomial likelihood function, this beta distribution prior combines algebraically into a closed form expression for another beta function with parameters which are adjusted by the values of k and n-k (the number of strike-outs). Note that walks (also being hit by the ball) do not count as times at bat.

This beta function posterior distribution then can be moved back to the other side of the Bayes equation when there is new information – another hit or strikeout.

Taking the average of the beta posterior as the best estimate of p, then, we get successive approximations, such as shown in the following graph.

So the player starts out really banging ‘em, and the frequentist estimate of his batting average for that season starts at 100 percent. The Bayesian estimate on the other hand is conditioned by a belief that his batting average should be somewhere around 0.34. In fact, as the grey line indicates, his actual probability p for that year is 0.3. Both the frequentist and Bayesian estimates converge towards this value with enough times at bat.

I used α=33 and β=55 for the initial values of the Beta distribution.

See this for a great discussion of the intuition behind the Beta distribution.

This, then, is a worked example showing how Bayesian methods can include prior information, and have small sample properties which can outperform a frequentist approach.

Of course, in setting up this example in a spreadsheet, it is possible to go on and generate a large number of examples to explore just how often the Bayesian estimate beats the frequentist estimate in the early part of a Bernoulli process.

Which goes to show that what you might call the classical statistical approach – emphasizing large sample properties, covering all cases, still has legs.

# Bayesian Methods in Biomedical Research

I’ve come across an interesting document – Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials developed by the Federal Drug Administration (FDA).

It’s billed as “Guidance for Industry and FDA Staff,” and provides recent (2010) evidence of the growing acceptance and success of Bayesian methods in biomedical research.

This document, which I’m just going to refer to as “the Guidance,” focuses on using Bayesian methods to incorporate evidence from prior research in clinical trials of medical equipment.

*Bayesian statistics is an approach for learning from evidence as it accumulates. In clinical trials, traditional (frequentist) statistical methods may use information from previous studies only at the design stage. Then, at the data analysis stage, the information from these studies is considered as a complement to, but not part of, the formal analysis. In contrast, the Bayesian approach uses Bayes’ Theorem to formally combine prior information with current information on a quantity of interest. The Bayesian idea is to consider the prior information and the trial results as part of a continual data stream, in which inferences are being updated each time new data become available.*

This Guidance focuses on medical devices and equipment, I think, because changes in technology can be incremental, and sometimes do not invalidate previous clinical trials of similar or earlier model equipment.

Thus,

*When good prior information on clinical use of a device exists, the Bayesian approach may enable this information to be incorporated into the statistical analysis of a trial. In some circumstances, the prior information for a device may be a justification for a smaller-sized or shorter-duration pivotal trial.*

*Good prior information is often available for medical devices because of their mechanism of action and evolutionary development. The mechanism of action of medical devices is typically physical. As a result, device effects are typically local, not systemic. Local effects can sometimes be predictable from prior information on the previous generations of a device when modifications to the device are minor. Good prior information can also be available from studies of the device overseas. In a randomized controlled trial, prior information on the control can be available from historical control data.*

The Guidance says that Bayesian methods are more commonly applied now because of computational advances – namely Markov Chain Monte Carlo (MCMC) sampling.

The Guidance also recommends that meetings be scheduled with the FDA for any Bayesian experimental design, where the nature of the prior information can be discussed.

An example clinical study is referenced in the Guidance – relating to a multi-frequency impedence breast scanner. This study combined clinical trials conducted in Israel with US trials,

The Guidance provides extensive links to the literature and to WinBUGS where BUGS stands for Bayesian Inference Using Gibbs Sampling.

**Bayesian Hierarchical Modeling **

One of the more interesting sections in the Guidance is the discussion of Bayesian hierarchical modeling. Bayesian hierarchical modeling is a methodology for combining results from multiple studies to estimate safety and effectiveness of study findings. This is definitely an analysis-dependent approach, involving adjusting results of various studies, based on similarities and differences in covariates of the study samples. In other words, if the ages of participants were quite different in one study than in another, the results of the study might be adjusted for this difference (by regression?).

An example of Bayesian hierarchical modeling is provided in approval of a device called for Cervical Interbody Fusion Instrumentation.

*The BAK/Cervical (hereinafter called the BAK/C) Interbody Fusion System is indicated for use in skeletally mature patients with degenerative disc disease (DDD) of the cervical spine with accompanying radicular symptoms at one disc level.*

The Summary of the FDA approval for this device documents extensive Bayesian hierarchical modeling.

**Bottom LIne**

Stephen Goodman from the Stanford University Medical School writes in a recent editorial,

*“First they ignore you, then they laugh at you, then they fight you, then you win,” a saying reportedly misattributed to Mahatma Ghandi, might apply to the use of Bayesian statistics in medical research. The idea that Bayesian approaches might be used to “affirm” findings derived from conventional methods, and thereby be regarded as more authoritative, is a dramatic turnabout from an era not very long ago when those embracing Bayesian ideas were considered barbarians at the gate. I remember my own initiation into the Bayesian fold, reading with a mixture of astonishment and subversive pleasure one of George Diamond’s early pieces taking aim at conventional interpretations of large cardiovascular trials of the early 80’s..It is gratifying to see that the Bayesian approach, which saw negligible application in biomedical research in the 80’s and began to get traction in the 90’s, is now not just a respectable alternative to standard methods, but sometimes might be regarded as preferable.*

There’s a tremendous video provided by Medscape (not easily inserted directly here) involving an interview with one of the original and influential medical Bayesians – Dr. George Diamond of UCLA.

URL: http://www.medscape.com/viewarticle/813984

# Bayesian Methods in Forecasting and Data Analysis

The basic idea of Bayesian methods is outstanding. Here is a way of incorporating prior information into analysis, helping to manage, for example, small samples that are endemic in business forecasting.

What I am looking for, in the coming posts on this topic, is what difference does it make.

**Bayes Theorem**

Just to set the stage, consider the simple statement and derivation of Bayes Theorem –

Here A and B are events or occurrences, and P(.) is the probability (of the argument . ) function. So P(A) is the probability of event A. And P(A|B) is the **conditional probability** of event A, given that event B has occurred.

A Venn diagram helps.

Here, there is the universal set U, and the two subsets A and B. The diagram maps some type of event or belief space. So the probability of A or P(A) is the ratio of the areas A and U.

Then, the conditional probability of the occurrence of A, given the occurrence of B is the ratio of the area labeled AB to the area labeled B in the diagram. Also area AB is the intersection of the areas A and B or A ∩ B in set theory notation. So we have P(A|B)=P(A ∩ B)/P(B).

By the same logic, we can create the expression for P(B|A) = P(B ∩ A)/P(A).

Now to be mathematically complete here, we note that intersection in set theory is commutative, so A ∩ B = B ∩ A, and thus P(A ∩ B)=P(B|A)•P(A). This leads to the initially posed formulation of Bayes Theorem by substitution.

So Bayes Theorem, in its simplest terms, follows from the concept or definition of conditional probability – nothing more.

**Prior and Posterior Distributions and the Likelihood Function**

With just this simple formulation, one can address questions that are essentially what I call “urn problems.” That is, having drawn some number of balls of different colors from one of several sources (urns), what is the probability that the combination of, say, red and white balls drawn comes from, say, Urn 2? Some versions of even this simple setup seem to provide counter-intuitive values for the resulting P(A|B).

But I am interested primarily in forecasting and data analysis, so let me jump ahead to address a key interpretation of the Bayes Theorem.

Thus, what is all this business about prior and posterior distributions, and also the likelihood function?

Well, considering Bayes Theorem as a statement of beliefs or subjective probabilities, P(A) is the **prior distribution**, and P(A|B) is the **posterior distribution**, or the probability distribution that follows revelation of the facts surrounding event (or group of events) B.

P(B|A) then is the **likelihood function**.

Now all this is more understandable, perhaps, if we reframe Bayes rule in terms of data y and parameters θ of some statistical model.

So we have

In this case, we have some data observations {y_{1}, y_{2},…,y_{n}}, and can have covariates x={x_{1},..,x_{k}}, which could be inserted in the conditional probability of the data, given the parameters on the right hand side of the equation, as P(y|θ,x).

In any case, clear distinctions between the Bayesian and frequentist approach can be drawn with respect to the likelihood function P(y|θ).

So the frequentist approach focuses on maximizing the likelihood function with respect to the unknown parameters θ, which of course can be a vector of several parameters.

As one very clear overview says,

*One maximizes the likelihood function L(·) with respect the parameters to obtain the maximum likelihood estimates; i.e., the parameter values most likely to have produced the observed data. To perform inference about the parameters, the frequentist recognizes that the estimated parameters ˆ result from a single sample, and uses the sampling distribution to compute standard errors, perform hypothesis tests, construct confidence intervals, and the like..*

In the Bayesian perspective, the unknown parameters θ are treated as random variables, while the observations y are treated as fixed in some sense.

The focus of attention is then on how the observed data y changes the prior distribution P(θ) into the posterior distribution P(y|θ).

*The posterior distribution, in essence, translates the likelihood function into a proper probability distribution over the unknown parameters, which can be summarized just as any probability distribution; by computing expected values, standard deviations, quantiles, and the like. What makes this possible is the formal inclusion of prior information in the analysis.*

One difference then is that the frequentist approach optimizes the likelihood function with respect to the unknown parameters, while the Bayesian approach is more concerned with integrating the posterior distribution to obtain values for key metrics and parameters of the situation, after data vector y is taken into account.

**Extracting Parameters From the Posterior Distribution**

The posterior distribution, in other words, summarizes the statistical model of a phenomenon which we are analyzing, given all the available information.

That sounds pretty good, but the issue is that the result of all these multiplications and divisions on the right hand side of the equation can lead to a posterior distribution which is difficult to evaluate. It’s a probability distribution, for example, and thus some type of integral equation, but there may be no closed form solution.

Prior to Big Data and the muscle of modern computer computations, Bayesian statisticians spent a lot of time and energy searching out conjugate prior’s. Wikipedia has a whole list of these.

So the Beta distribution is a conjugate prior for a Bernoulli distribution – the familiar probability p of success and probability q of failure model (like coin-flipping, when p=q=0.5). This means simply that multiplying a Bernoulli likelihood function by an appropriate Beta distribution leads to a posterior distribution that is again a Beta distribution, and which can be integrated and, also, which supports a sort of loop of estimation with existing and then further data.

Here’s an example and prepare yourself for the flurry of symbolism –

Note the update of the distribution of whether the referendum is won or lost results in a much sharper distribution and increase in the probability of loss of the referendum.

**Monte Carlo Methods**

Stanislaus Ulam, along with John von Neumann, developed Monte Carlo simulation methods to address what might happen if radioactive materials were brought together in sufficient quantities and with sufficient emissions of neutrons to achieve a critical mass. That is, researchers at Los Alamos at the time were not willing to simply experiment to achieve this effect, and watch the unfolding.

Monte Carlo computation methods, thus, take complicated mathematical relationships and calculate final states or results from random assignments of values of the explanatory variables.

Two algorithms—the Gibbs sampling and Metropolis-Hastings algorithms— are widely used for applied Bayesian work, and both are Markov chain Monte Carlo methods.

The Markov chain aspect of the sampling involves selection of the simulated values along a path determined by prior values that have been sampled.

The object is to converge on the key areas of the posterior distribution.

**The Bottom Line**

It has taken me several years to comfortably grasp what is going on here with Bayesian statistics.

The question, again, is what difference does it make in forecasting and data analysis? And, also, if it made a difference in comparison with a frequentist interpretation or approach, would that be an entirely good thing?

A lot of it has to do with a reorientation of perspective. So some of the enthusiasm and combative qualities of Bayesians seems to come from their belief that their system of concepts is simply the only coherent one.

But there are a lot of medical applications, including some relating to trials of new drugs and procedures. What goes there? Is the representation that it is not necessary to take all this time required by the FDA to test a drug or procedure, when we can access prior knowledge and bring it to the table in evaluating outcomes?

Or what about forecasting applications? Is there something more productive about some Bayesian approaches to forecasting – something that can be measured in, for example, holdout samples or the like? Or I don’t know whether that violates the spirit of the approach – holdout samples.

I’m planning some posts on this topic. Let me know what you think.

Top picture from Los Alamos laboratories