Inductive biases have a reputation for being arcane academic knowledge, to be studied for statistical learning theory or when you propose a new ML algorithm. However, inductive biases have many practical implications for model robustness, interpretability, and leveraging domain knowledge.
By thinking more about inductive biases, we can become more mindful modelers.
In a way, everyone who applies machine learning is an algorithm designer. Even if you start with, say, plain CatBoost you might do feature engineering, hyperparameter tuning, try other algorithms, and maybe come up with an ensemble. With each step, you compile an application-specific set of inductive biases. However, this process is indirect because it’s often guided by optimizing predictive performance and not explicitly deciding on inductive biases. But you can also proactively work with inductive biases to make models interpretable, robust, and plausible.
Introduce simplicity and interpretability into models
One way to make machine learning models more understandable is to pick ML algorithms that produce “inherently” interpretable models. This category of models includes simple linear regression models, decision rule lists, and even some neural network architectures that allow means of interpretation. This type of interpretability is achieved by restricting the hypothesis space of the learning algorithm (=restrictive inductive biases).
You can achieve a better understanding of interpretability when you move away from thinking in a category of “inherently interpretable models”, and rather think of inductive biases that help with interpretability.
For example, when you set the max. depth of the trees in xgboost to 2, you get a maximum interaction depth of 2 for your entire models. This means that you can actually plot the effects of the entire model by plotting all first and second-order ALE plots. Depending on how many features you have, this would arguably constitute a fully interpretable model. See also the following post:
Fix “data problems”
Caruana et. al (2015)1 worked on a prediction model for the probability of death from pneumonia for patients presenting at the ER. The model falsely learned that asthma has a protective effect while it is known that asthma increases risk. The reason was a lack of information: The model was trained on information at the time of admission, but not on the treatment. Asthma patients tend to get more aggressive treatment lowering their risk.
Caruana et. al trained a GAM with interactions (they call it GA2M). They modified the GA2M by removing the Asthma effect. This is notable in the light of inductive biases for two reasons: First, they had designed their models to be interpretable, so it’s an example of using inductive biases for interpretability. Second, they discovered a problem with the data and removed all terms involving asthma by zeroing them out. This is like adding a further restriction on the hypothesis space. Arguably, since it’s after learning, it might or might not be seen as inductive bias, but it has a restrictive effect on the model space.
While it’s an interesting example, I’m not sure whether the paper authors actually fixed the problem. There may be other means by which the model may have (indirectly) detected asthma. Or there may be other issues due to lack of treatment information. However, it’s an interesting example of leveraging inductive biases to address data issues.
Intentional extrapolation
I trained an xgboost model to forecast water supply for different sites in the Western US. I tried various models, including linear regression, but xgboost performed best. However, while writing about inductive biases of the random forest, I realized something.
All tree-based models have, as a consequence of their inductive biases (partitioning with constant models), a certain extrapolation behavior: If you increase a feature beyond its maximum, the prediction no longer changes.
The most important feature was the snow coverage. And the more snow, the more water will flow, that’s the rough relation. What if my model would now encounter a year with a new snow record? Let’s say this: If I replaced the snow record with the previous snow maximum in the training, both would produce the same prediction.
It doesn’t matter whether it’s 1% more or 100% more snow than the previous maximum. Reality wouldn’t behave like this. The inductive biases don’t match the domain knowledge.
But switching to a linear model wasn’t an option either: the predictive performance was meh. So clearly, some hybrid approach would have been great that would have the performance of the xgboost model inside the convex hull of the data, but when extrapolating I might want to work with assumptions.
I could, for example, blend the xgboost model and the linear model, so that when extrapolating I still get an increase in the prediction. Or I could rely on an algorithm that mixes linear and partition models. But this would also imply a linear relation between snow and water, which should be discussed with a domain expert first.
Reflect the data-generating process
By picking our inductive biases, we can adapt our model better to the data-generating process, thereby making it more robust and reflective of reality.
If you have repeated measurements, let’s say you work with medical data (doctor visits) and some patients have visited multiple times, then you may want to introduce inductive biases in your models to reflect that.
What happens if you don’t? Let’s say you fit a random forest. When we subsample with the random forest, depending on how often the measures are repeated, most trees may have data from all subjects. That means the trees are highly correlated (more than they would be for independent data), so we lose an inductive bias contributing to performance (de-correlated trees). It also means that out-of-bag performance estimates are overly confident. Further, if your goal is to apply the forest to new patients, it would be better when the random forest algorithm subsamples patients, not visits, to reflect reality. A solution would be to use a modified version of the random forest with group-based subsampling, for example by Karpievitch et. al, 20092.
Admittedly, the more you want to adapt your model to your application, the further you stray from standard implementations of ML algorithms, and the more you have to rely on less well-tested implementations by academic researchers or you have to implement your own algorithms. It can be difficult to find a compromise between implementation robustness and application specificity.
Study inductive biases to understand your data
Comparing different ML algorithms means comparing different sets of inductive biases. This is even true when it’s just different hyperparameter configurations. When you understand the different inductive biases between the ML algorithms, you can extract insights about your data. For example, when you compare a linear regression model and a GAM, you know that all performance gains are due to switching from linear to smooth relations.
So it goes both ways: We can intentionally choose inductive biases to improve interpretability, robustness, and plausibility, but we can also learn from the inductive biases that helped with performance increases.
Next week, I will talk about how we need better language to talk about inductive biases and beyond to become better at modeling.
Caruana, Rich, et al. "Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission." Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 2015.
Karpievitch YV, Hill EG, Leclerc AP, et al. An introspective comparison of random forest-based classifiers for the analysis of cluster-correlated data by way of RF++. PLoS One 2009; 4 (9):e7087.
Nice explanation of the inductive biases.
I really enjoy reading your articles.
Thank you Molnar.