Inductive biases of the Random Forest and their consequences

part 4 of the inductive bias series

May 21, 2024

Machine learning needs inductive biases to work and these inductive biases have real consequences on model interpretability, model robustness, and how the model extrapolates. In this post, I discuss inductive biases of the Random Forest and how they affect the final model.

Quick Random Forest1 recap: A Random Forest model is a set of decision trees, usually a couple of hundred. For prediction/classification each tree makes a prediction and the result is the average/majority. Each decision tree is trained independently of the others. To make the individual trees different from each other, the training data for each tree is bootstrapped and a feature subset is randomly drawn at each split.

Before we go into the inductive biases of the Random Forest, we must distinguish between the inductive bias of the learning algorithm and the “model assumptions” they imply.

Usually, model assumptions are explicit hypotheses about the relationships in the data, like in statistics, where we might assume linearity between variables and targets. In this post, I argue that we do the same with machine learning, it’s just via inductive biases and more implicit.

Some Inductive Biases of the Random Forest

For the Random Forest, we can roughly say that the inductive biases come from:

the decision tree learning algorithm
the random factors (bootstrapping and column sampling)
ensembling the trees

The original Random Forest implementation is based on CART2 (classification and regression trees), a decision tree algorithm that produces trees using binary splits. Within the terminal nodes, we have constant models, either the majority class for classification or the mean for regression. Many inductive biases stem from these CART trees forming the basis of the Random Forest.

Inductive Bias: CART only allows step functions

By using data partitioning (node splits) to form the prediction function, CART trees only allow changes in the form of “jumps”.

This leads to the following implicit model assumptions and consequences:

The relation between features and target is a non-smooth function. Ensembling multiple trees makes these steps much more fine-grained, but it’s still a step-function.
The gradient of the prediction function is undefined for jump points and zero elsewhere.
Linear relations between a feature and the target can only be approximated via step functions.

The decision tree (red line) can only approximate the true linear function (data in blue) with a step function. Ensembling many trees (green line on the right) smooths this out but remains a step function. Both use maxdepth=3.

When extrapolating a feature beyond its minimum or maximum, the prediction remains fixed at a constant value, since the node that the data point falls into doesn’t change no matter how much beyond the limits we go.

Partial Dependence Plot of MedInc: Predicting California Housing prices with different types of models and increasing the MedInc variable beyond its maximum. The prediction of tree-based models doesn’t change when we go beyond the feature’s maximum value.

Inductive Bias: CART favors deep interactions

CART uses binary splits. If you grow the tree very deep, the algorithm produces models with many interactions between the features.

How this affects the model:

When the true relation between features and target is additive (without interactions) the decision tree learner will have a tough time replicating this. Imagine the first split is in X1>10: the subtrees in the left node would have to produce an identical prediction function to the right node, or, otherwise, there would be an interaction modeled between X1 and the remaining features. Ensembling trees might smooth this out a bit.

Inductive Bias: CART favors features with many unique values

This is a more implicit inductive bias that I assume wasn’t Breimans intention when proposing CART and was later described by Hothorn et. al (2006).3 The problem is one of multiple testing: CART searches for the next split via exhaustive search. All features and possible split points within them are searched. For binary features, there is only one possible split. A categorical feature with 10 categories has 9 possible splits. A numerical feature has up to n-1 splits, where n is the number of data points in the node. Since features with more possible splits are tested more frequently, by chance they will be picked more often.

This inductive bias makes the following implicit model assumptions:

The more unique values a feature has, the more likely it is to be used by the Random Forest. This also affects model interpretation: The more unique values, the higher the feature’s importance.
When the tree picks up irrelevant features, they are more likely to be numerical or with multiple levels.

Inductive Bias: Maxdepth controls interactions

The maximum depth parameter controls how deep a tree may grow. This parameter also caps the maximum number of features that can interact with each other.

What happens at maxdepth=1:

For a single decision tree, the resulting model will be a simple if-else rule with just one condition.
Ensembles of trees with depth=1 are additive models (no interactions between the features).

What happens at maxdepth=k:

A Random Forest with maxdepth=k means that interactions between k+1 features at a time are excluded from the resulting model.

Inductive Bias: Trees are not rotation invariant

If you multiply your data with a rotation matrix you will get a very different decision tree model, since tree-based learning methods attend features one by one (split by split) and not as a whole. When training multi-layer perceptrons (MLPs) which are rotation-invariant, a rotation of the input space doesn’t change the model that is learned.

According to Grinsztajn et. al (2022)4, tree-based methods work well for tabular data because they are not rotational invariant. In tabular data, the feature columns are often individually meaningful, and mixing them with other columns by rotating them is a disadvantage. An MLP first has to learn the right rotation and therefore has a more difficult task.
Sparse solutions: rotationally invariant models have a hard time distinguishing relevant and irrelevant features. Trees and forests are good at separating relevant and irrelevant and offer sparser solutions.

Inductive Bias: Column sampling creates preferences for redundant features

The Random Forest algorithm samples feature columns whenever a tree makes the next split. This column has many consequences for the model:

Sampling features increases the model’s reliance on redundant features. If X2 is a noisy copy of a strong feature X1, then by sampling the features that a node can use for splitting, in many cases X2 will be available, but not X1, and the split will be based on X2. A learning algorithm that can pick from all features (e.g. gradient boosting) might prefer X1 over X2. This makes the Random Forest less sparse.
The sampling also affects the interpretation of feature importance: Sampling only a few features at each split causes (prediction-based) feature importance to spread more equally between the features.
Gini Feature Importance for Random Forests trained on California Housing data. 4 forests were trained with different numbers of sampled features per split: 1 feature (MSE on test=0.285), 3 features (MSE=0.246), 6 features (MSE=0.256), and 8 features (MSE=0.256). The fewer features are sampled, the more equally the importance is spread out between the features.
But why does the Random Forest even use this inductive bias? The column sampling makes the ensemble more robust since it can rely on different features for prediction.

If you use scikit-learn, there are different defaults for the number of sampled features:

RandomForestRegressor defaults to max_features=1.0, which means using all features, and in fact, you don’t get a Random Forest but bagged trees.
RandomForestClassifier defaults to max_features=’sqrt’, which means sqrt(p) features are used, where p is the number of all features.

If you use the randomForest package in R, it’s p/3 for regression and sqrt(p) for classification.

So either set these parameters yourself or be aware of the default parameters in different implementations.

Inductive Bias: Tree-ensembles can fit non-smooth patterns

Neural networks fit very smooth patterns, but the Random Forest is excellent at fitting non-smooth patterns, which is a combined inductive bias of decision trees and ensembling. Grinsztajn et. al (2022) studied which inductive biases make tree-based models so strong compared to neural networks, and found that this non-smoothness seems to be a strength for tabular data.

Predicting electricity demand. The Random Forest can capture some data intricacies better than the MLP. From Grinsztajn et. al (2022)

Understand your ML algorithms

Not too long ago I said you don’t have to know in detail how each ML algorithm works. All hidden behind a layer of AutoML and you just get out whatever ML algorithm produced the best-performing model. Then slap on some model-agnostic tools for interpretability and uncertainty and you are done.

But today I say that it does matter what algorithm you use. As discussed in this post, inductive biases affect interpretation and robustness.

Next week we will bring classic statistical modeling and machine learning closer together by seeing them through inductive biases.

In other news: Modeling Mindsets for teaching

Many readers told me they found my book Modeling Mindets: The Many Cultures of Learning From Data useful for sorting their thoughts on the different approaches from causal inference to reinforcement learning.

Are you using Modeling Mindsets for teaching or communicating with other people? I’d be happy to hear about it. Just send an email to christoph.molnar.ai at gmail.com.

If you use it in teaching, I’d be happy to provide free digital versions for your students. Just get in touch.

Breiman, Leo. "Random forests." Machine learning 45 (2001): 5-32.

Breiman, Leo. Classification and regression trees. Routledge, 2017.

Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. "Unbiased recursive partitioning: A conditional inference framework." Journal of Computational and Graphical statistics 15.3 (2006): 651-674.

Grinsztajn, Léo, Edouard Oyallon, and Gaël Varoquaux. "Why do tree-based models still outperform deep learning on typical tabular data?." Advances in neural information processing systems 35 (2022): 507-520.

Ibtesam Ahmed

Jul 20, 2024

Hi Christoph, thanks for writing a very useful article. I was wondering when you said "If X2 is a noisy copy of a strong feature X1, then by sampling the features that a node can use for splitting, in many cases X2 will be available, but not X1, and the split will be based on X2." why X2 would be picked over X1, since X1 is less noisy it would lead to a more homogenous split, right? Ideally, X1 should be picked more

Expand full comment

1 reply by Christoph Molnar

Pietro

Jun 2, 2024

Brilliant article!I'm a new reader, but I'm trying catching up on the older articles and books, and I'm enjoying them very much. I'm really enjoying this inductive bias series.

I have a question: Why do you think tabular data tends to exhibit non-smooth patterns in the underlying 'real' function? Coming from a background in physics, I'm accustomed to mostly using smooth and continuous functions to model reality. It feels unusual to expect a non-smooth model to perform better. Could you elaborate on this? Do you have any papers or information that could provide more insight into this? Thanks!

2 more comments...

Mindful Modeler

Discussion about this post