Ignore inductive biases at your own peril

part 3 of the inductive bias series

May 07, 2024

In this week’s issue of the inductive bias series, we finally get to the core of inductive biases: How to think about them, why they are hidden from us usually, and why you should care about them.

Machine learning solves a fundamental inductive problem: there is an infinite number of functions that map from the features to the target. Even when we only look for functions with a low loss, there is still an infinite number left.

There is no single “correct” function that perfectly predicts our training data and all our future data, except for simulated data. As I talked about last week, induction — learning general rules from specific examples — requires assumptions. Machine learning is no exception. It needs some kind of principles to pick one out of the infinitely many functions.

These principles are called “inductive biases”.

An inductive bias is a learning algorithm’s preference for a prediction function (aka hypothesis) over another.

Machine learning without inductive biases would be a database of the training data. Only by introducing inductive biases can any learning take place.

The space of all prediction functions is also called the hypothesis space. Imagine it as a bag that holds all possible functions:

Big circle title"hypothesis space". Text says that the hypothesis space contains all prediction functions that map input space X to output space Y

This bag contains all the functions that map from X to Y you can think of:

xgboost trained on the training data
a constant function that always predicts pi: f(x) = π
a linear regression using only the first feature
an ensemble of a million trees
…

We can consider machine learning algorithms as a type of search for a function in that hypothesis space. Inductive biases guide the algorithm where to search:

A restriction bias makes the learning algorithm ignore many functions and only seek out specific forms. For example, linear regression puts a restrictive bias on the hypothesis space: Only functions of the form f(X) = βX are allowed.
A preference bias makes the learning algorithm prefer one model over another. If you use LASSO (linear regression model with an additional L1 penalty) then linear models with some coefficients set to zero are preferred (in addition to the restriction of only producing linear models).

Large circle representing the hypothesis space. Within it another green circle is shown. It represents a restriction bias. Within that circle is a highlighted area that represents a preference bias

These are classic examples of inductive biases. But really, all modeling choices introduce inductive biases:

Feature engineering: Transforming the features introduces inductive biases. For example, how you represent a categorical feature (one-hot encoding, integer encoding, target encoding, …), whether you scale features, and so on.
Dimensionality reduction: If you use dimensionality reduction as part of the pipeline of the learning algorithm (e.g. PCA, LASSO), it affects the sparseness of the model.
Model choice: A linear regression model implies very different inductive biases from a decision tree.
Hyperparameter settings: Some hyperparameter settings can introduce rather strong inductive biases. Setting the maximum depth of decision trees in Catboost to 1 will create a restriction bias so that only additive models can be produced.
Architecture choices: Adding, removing, and changing layers in deep neural networks also fiddles with inductive biases.

Modeling doesn’t feel like picking inductive biases

You don’t start a machine learning project with “Let’s list all the inductive biases we need for this project”. You can finish an ML project without thinking about inductive biases. But I wouldn’t recommend it.

We always pick our inductive biases. It’s just hidden.

Inductive biases hidden in state-of-the-art algorithms

Why don’t you use a convolutional neural network on tabular data? Because CNNs are for image data. Everyone knows this. In theory, you could rearrange your tabular data to form images and use CNNs despite all cells in your body saying no. CNNs come with very specific inductive biases for image data like that neighboring pixels should be processed together. Instead of saying that CNNs are state-of-the-art for image data, we could also say the inductive biases of CNNs have shown usefulness on many image datasets. When ML researchers develop state-of-the-art algorithms they add and remove certain inductive biases. Using such an ML algorithm means bringing a huge set of inductive biases to your task. These inductive biases have been tested and tried and evolved over time.

Inductive biases are automatically selected

Machine learning can be automated to a large degree. Think of model selection, hyperparameter tuning, and regularization. But all these steps are just part of the search for a prediction function, a hypothesis. So it’s automated testing of different model configurations and seeing which ones work well. As discussed before, choices of architectures, model classes, and hyperparameter configurations imply different inductive biases, so these automated processes also figure out which inductive biases work well.

This automation of inductive bias selection is very particular to supervised machine learning. When you work in self-supervised learning, choices of inductive biases are much more deliberate and explicit (see this paper, and this paper and this one).

Inductive biases are difficult to speak about

We don’t have good language and mental models to speak about inductive biases. It’s difficult to list all the inductive biases that e.g. CatBoost implies, especially since these biases depend on the exact hyperparameter configuration. And also the biases strongly interact with the data. LASSO trained on a few data points may lead to a sparse linear model, but using the same learning algorithm with the same hyperparameters on a larger dataset might not be sparse at all.

We need to understand inductive biases

Inductive biases are like verbs that describe the learning process. Once the learning process is done, we are left with a “static” prediction model. But the inductive biases haven’t vanished, they just have turned from verbs to adjectives. Inductive biases shaped the model.

Sometimes it’s obvious: If you used linear regression, you know the final model is a linear regression model which comes with many implications and an understanding of how that model will predict new data. We know that it’s an additive model, we know how it extrapolates, we know that it doesn’t cover interactions, and so on.

But what about more complex classes? What if you trained a random forest? Certain inductive biases also guide the random forest algorithm. What inductive imprints did the algorithm leave on the final model? This is more difficult to describe than linear regression, but not impossible.

These inductive imprints have actual implications on how to interpret the model, how robust the model is, how it extrapolates, how it will deal with distribution shifts and adversarial examples, etc.

If we ignore inductive biases, we fail to understand all these nuances that make or break your model. Ultimately, all these inductive biases instill model assumptions that we may not be aware of, but that we can deduct from the inductive biases.

Next week, I will discuss some of the inductive biases of the random forest, and what model assumptions they produce.

adosar

Sep 10

First of all, this is a very nice article! I have some questions regarding the inductive biases introduced by feature engineering.

Do all types of inductive biases essentially boil down to preference over specific functions? For example, lets say we are asked to predict house prices. If we fix a learner, does the usage of features x1, x2 (e.g. house area in m^2, number of tables) instead of x3, x4 (e.g. number of rooms, number of swimming pools) introduce any inductive bias? I mean in both cases we are just left mapping from R^2 to R.

The only way I can think of features introducing an inductive bias is if we view the problem as following. There is an input space X and we are interested in finding a map X -> Y. If we use a learner with x1 and x2, then this amounts to finding a function:

X -> g(X) -> h(g(X)) -> Y

where (x1, x2) = g(X). Now we have an inductive bias, since the function the learner must pick is the total composition h(g(X)) and the constraint arises since it must include g(X).

Expand full comment

1 reply by Christoph Molnar

1 more comment...

Mindful Modeler

Discussion about this post