As a term, random forests apparently is trademarked, which is, in a way, a shame because it is so evocative – random forests, for example, are comprised of a large number of different decision or regression trees, and so forth.
Whatever the name we use, however, the Random Forest™ algorithm is a powerful technique. Random subspace ensemble methods form the basis for several real world applications, such as Microsoft’s Kinect, facial recognition programs in cell phone and other digital cameras, and figure importantly in many Kaggle competitions, according to Jeremy Howard, formerly Kaggle Chief Scientist.
I assemble here a Howard talk from 2011 called “Getting In Shape For The Sport Of Data Science” and instructional videos from a data science course at the University of British Columbia (UBC). Watching these involves a time commitment, but it’s possible to let certain parts roll and then to skip ahead. Be sure and catch the last part of Howard’s talk, since he’s good at explaining random subspace ensemble methods, aka random forests.
It certainly helps me get up to speed to watch something, as opposed to reading papers on a fairly unfamiliar combination of set theory and statistics.
By way of introduction, the first step is to consider a decision tree. One of the UBC videos notes that decision trees faded from popularity some decades ago, but have come back with the emergence of ensemble methods.
So a decision tree is a graph which summarizes the classification of multi-dimensional points in some space, usually based on creating rectangular areas with reference to the coordinates. The videos make this clearer.
So this is nice, but decision trees of this sort tend to over-fit; they may not generalize very well. There are methods of “pruning” or simplification which can help generalization, but another tactic is to utilize ensemble methods. In other words, develop a bunch of decision trees classifying some set of multi-attribute items.
Random forests simply build such decision trees with a randomly selected group of attributes, subsets of the total attributes defining the items which need to be classified.
The idea is to build enough of these weak predictors and then average to arrive at a modal or “majority rule” classification.
Here’s the Howard talk.
Then, there is an introductory UBC video on decision trees
This video goes into detail on the method of constructing random forests.
Then the talk on random subspace ensemble applications.