Inductive biases of the Random Forest and their consequences
part 4 of the inductive bias series
Machine learning needs inductive biases to work and these inductive biases have real consequences on model interpretability, model robustness, and how the model extrapolates. In this post, I discuss inductive biases of the Random Forest and how they affect the final model.
Quick Random Forest1 recap: A Random Forest model is a set of decision trees, usually a couple of hundred. For prediction/classification each tree makes a prediction and the result is the average/majority. Each decision tree is trained independently of the others. To make the individual trees different from each other, the training data for each tree is bootstrapped and a feature subset is randomly drawn at each split.
Before we go into the inductive biases of the Random Forest, we must distinguish between the inductive bias of the learning algorithm and the “model assumptions” they imply.
Usually, model assumptions are explicit hypotheses about the relationships in the data, like in statistics, where we might assume linearity between variables and targets. In this post, I argue that we do the same with machine learning, it’s just via inductive biases and more implicit.
Some Inductive Biases of the Random Forest
For the Random Forest, we can roughly say that the inductive biases come from:
the decision tree learning algorithm
the random factors (bootstrapping and column sampling)
ensembling the trees
The original Random Forest implementation is based on CART2 (classification and regression trees), a decision tree algorithm that produces trees using binary splits. Within the terminal nodes, we have constant models, either the majority class for classification or the mean for regression. Many inductive biases stem from these CART trees forming the basis of the Random Forest.
Inductive Bias: CART only allows step functions
By using data partitioning (node splits) to form the prediction function, CART trees only allow changes in the form of “jumps”.
This leads to the following implicit model assumptions and consequences:
The relation between features and target is a non-smooth function. Ensembling multiple trees makes these steps much more fine-grained, but it’s still a step-function.
The gradient of the prediction function is undefined for jump points and zero elsewhere.
Linear relations between a feature and the target can only be approximated via step functions.
When extrapolating a feature beyond its minimum or maximum, the prediction remains fixed at a constant value, since the node that the data point falls into doesn’t change no matter how much beyond the limits we go.
Inductive Bias: CART favors deep interactions
CART uses binary splits. If you grow the tree very deep, the algorithm produces models with many interactions between the features.
How this affects the model:
When the true relation between features and target is additive (without interactions) the decision tree learner will have a tough time replicating this. Imagine the first split is in X1>10: the subtrees in the left node would have to produce an identical prediction function to the right node, or, otherwise, there would be an interaction modeled between X1 and the remaining features. Ensembling trees might smooth this out a bit.
Inductive Bias: CART favors features with many unique values
This is a more implicit inductive bias that I assume wasn’t Breimans intention when proposing CART and was later described by Hothorn et. al (2006).3 The problem is one of multiple testing: CART searches for the next split via exhaustive search. All features and possible split points within them are searched. For binary features, there is only one possible split. A categorical feature with 10 categories has 9 possible splits. A numerical feature has up to n-1 splits, where n is the number of data points in the node. Since features with more possible splits are tested more frequently, by chance they will be picked more often.
This inductive bias makes the following implicit model assumptions:
The more unique values a feature has, the more likely it is to be used by the Random Forest. This also affects model interpretation: The more unique values, the higher the feature’s importance.
When the tree picks up irrelevant features, they are more likely to be numerical or with multiple levels.
Inductive Bias: Maxdepth controls interactions
The maximum depth parameter controls how deep a tree may grow. This parameter also caps the maximum number of features that can interact with each other.
What happens at maxdepth=1:
For a single decision tree, the resulting model will be a simple if-else rule with just one condition.
Ensembles of trees with depth=1 are additive models (no interactions between the features).
What happens at maxdepth=k:
A Random Forest with maxdepth=k means that interactions between k+1 features at a time are excluded from the resulting model.
Inductive Bias: Trees are not rotation invariant
If you multiply your data with a rotation matrix you will get a very different decision tree model, since tree-based learning methods attend features one by one (split by split) and not as a whole. When training multi-layer perceptrons (MLPs) which are rotation-invariant, a rotation of the input space doesn’t change the model that is learned.
According to Grinsztajn et. al (2022)4, tree-based methods work well for tabular data because they are not rotational invariant. In tabular data, the feature columns are often individually meaningful, and mixing them with other columns by rotating them is a disadvantage. An MLP first has to learn the right rotation and therefore has a more difficult task.
Sparse solutions: rotationally invariant models have a hard time distinguishing relevant and irrelevant features. Trees and forests are good at separating relevant and irrelevant and offer sparser solutions.
Inductive Bias: Column sampling creates preferences for redundant features
The Random Forest algorithm samples feature columns whenever a tree makes the next split. This column has many consequences for the model:
Sampling features increases the model’s reliance on redundant features. If X2 is a noisy copy of a strong feature X1, then by sampling the features that a node can use for splitting, in many cases X2 will be available, but not X1, and the split will be based on X2. A learning algorithm that can pick from all features (e.g. gradient boosting) might prefer X1 over X2. This makes the Random Forest less sparse.
The sampling also affects the interpretation of feature importance: Sampling only a few features at each split causes (prediction-based) feature importance to spread more equally between the features.
But why does the Random Forest even use this inductive bias? The column sampling makes the ensemble more robust since it can rely on different features for prediction.
If you use scikit-learn, there are different defaults for the number of sampled features:
RandomForestRegressor defaults to max_features=1.0, which means using all features, and in fact, you don’t get a Random Forest but bagged trees.
RandomForestClassifier defaults to max_features=’sqrt’, which means sqrt(p) features are used, where p is the number of all features.
If you use the randomForest package in R, it’s p/3 for regression and sqrt(p) for classification.
So either set these parameters yourself or be aware of the default parameters in different implementations.
Inductive Bias: Tree-ensembles can fit non-smooth patterns
Neural networks fit very smooth patterns, but the Random Forest is excellent at fitting non-smooth patterns, which is a combined inductive bias of decision trees and ensembling. Grinsztajn et. al (2022) studied which inductive biases make tree-based models so strong compared to neural networks, and found that this non-smoothness seems to be a strength for tabular data.
Understand your ML algorithms
Not too long ago I said you don’t have to know in detail how each ML algorithm works. All hidden behind a layer of AutoML and you just get out whatever ML algorithm produced the best-performing model. Then slap on some model-agnostic tools for interpretability and uncertainty and you are done.
But today I say that it does matter what algorithm you use. As discussed in this post, inductive biases affect interpretation and robustness.
Next week we will bring classic statistical modeling and machine learning closer together by seeing them through inductive biases.
In other news: Modeling Mindsets for teaching
Many readers told me they found my book Modeling Mindets: The Many Cultures of Learning From Data useful for sorting their thoughts on the different approaches from causal inference to reinforcement learning.
Are you using Modeling Mindsets for teaching or communicating with other people? I’d be happy to hear about it. Just send an email to christoph.molnar.ai at gmail.com.
If you use it in teaching, I’d be happy to provide free digital versions for your students. Just get in touch.
Breiman, Leo. "Random forests." Machine learning 45 (2001): 5-32.
Breiman, Leo. Classification and regression trees. Routledge, 2017.
Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. "Unbiased recursive partitioning: A conditional inference framework." Journal of Computational and Graphical statistics 15.3 (2006): 651-674.
Grinsztajn, Léo, Edouard Oyallon, and Gaël Varoquaux. "Why do tree-based models still outperform deep learning on typical tabular data?." Advances in neural information processing systems 35 (2022): 507-520.
Hi Christoph, thanks for writing a very useful article. I was wondering when you said "If X2 is a noisy copy of a strong feature X1, then by sampling the features that a node can use for splitting, in many cases X2 will be available, but not X1, and the split will be based on X2." why X2 would be picked over X1, since X1 is less noisy it would lead to a more homogenous split, right? Ideally, X1 should be picked more
Brilliant article!I'm a new reader, but I'm trying catching up on the older articles and books, and I'm enjoying them very much. I'm really enjoying this inductive bias series.
I have a question: Why do you think tabular data tends to exhibit non-smooth patterns in the underlying 'real' function? Coming from a background in physics, I'm accustomed to mostly using smooth and continuous functions to model reality. It feels unusual to expect a non-smooth model to perform better. Could you elaborate on this? Do you have any papers or information that could provide more insight into this? Thanks!