Introduction to Random Forests
What is a Random Forests model?
Random Forests is an ensemble machine learning method for classification or regression based on the results of a multitude of decision trees. The "Trees" in the name comes from how the model is made up of descision trees, and the Random part comes from the stochastic (aka random) methodology in how these trees are created.
In this article, we'll be walking you through how a Random Forests model is constructed, and how its properties make it a good classification model.
What are Random Forests models used for?
Random Forests models are used to predict a class or value. It uses an aggregation of votes (or predicted values) per tree in its "forest" to come up with an overall predicted value. In the case of classification, the decision goes to the class with the most votes. For regression, the predicted value will be the average value of all trees in the forest.
How are Random Forests models constructed?
Random Forests are made up of decision trees. Formally, these trees are known as Classification and Regression Trees (CARTs). Each tree is grown to predict some value or class, with
- Nodes acting as bifurcation points, where samples are funneled through either one or another branch, based on given data, and
- End Leaves acting as the predicted outcome.
Decision trees are constructed such that the more relevant and discriminatory a variable is, the higher up it is on the tree structure. This allows for a more efficient means of classifying samples.
Trees are easy to understand, diagram, and computationally cheap to construct. However, they can carry a high bias towards what your training dataset looks like. If your training dataset includes a variable that just to happens have a high discriminatory ability, the decision tree will be highly biased towards that variable. In other words, it doesn't leave much room for the other potentially important variables to "shine."
In order to circumvent this bias, Random Forests constructs an ensemble (or forest) of these decision trees, and averages out the results in order to come up with a prediction.
How does Random Forests create these decision trees such that they're slightly different from another?
In the Random Forests model, hundreds of trees are created, with each tree differing slightly from each other kind. There are two ways in the trees' construction that are randomized:
1) Bagging (Bootstrap Aggregation)
In order to lower the bias that can come from the training set, Random Forests uses a method known as Bootstrap Aggregation, or bagging. Here, a random subset of samples (about 2/3) are selected to construct each tree.
2) Randomly selecting variables per split point (mtry)
Another point of randomization while growing each tree is limiting the number of variables the tree may split by. If the constructed decision trees had access to all predictors p, it would simply select the variable with the most discriminative ability. Depending on whether the variable had such an ability by chance, this could severely bias our model. Instead, we can limit the number of variables the tree may choose from with the mtry parameter, allowing other variables a fair chance to be a discriminatory factor in the tree.
In most cases, mtry is equal to the square root of available variables m, or just m/3.
The Random Forests model continues to grow trees until a "forest" of trees is grown. The number of trees grown this way is referred to ntree.
How does OOB Cross-Validation work in Random Forests?
Per each tree constructed (from the 2/3 bootstrap sample), the model predicts on the remaining samples (about 1/3) for cross-validation. The 1/3 remaining sample is known as the out-of-bag (OOB) sample. The OOB methodology and the resulting accuracy rate makes it convenient such that other Cross-Validation methods do not have to be employed. It can be shown that the OOB error rate comes close to the Leave-One-Out Cross Validation method.
Now that the model has been created, how do we know what variables were most important? In order to do this, Random Forests takes each variable, and randomly permutes its values.
- If there was no change in accuracy, then that would mean the variable wasn't that important. Low %IncMSE
- If there was a big change in accuracy, then that variable is important, as messing with its values has an effect on the model. High %IncMSE