No Free Dessert in Machine Learning
Generalization in machine learning is too narrow. If you want more, you have to give more.
A machine learning model generalizes well when it performs effectively on unseen data.
Generalization simply means that we transfer the ability to predict well from training data to unseen data:
Generalization is a powerful concept: we can’t train a model on unseen data, since by definition it becomes "seen" data. Yet, we have designed training procedures to measure a model's performance on unseen data and even optimize for it (hyperparameter tuning, model selection, and feature engineering). The solution to the generalization problem is data splitting. However, it can get complex with techniques such as repeated nested cross-validation and non-IID data.
Not only have we developed the “art” of machine learning, but there’s also science to back up the process of learning. Statistical learning theory tells us all about the consistency of estimators, about boundaries of generalization error estimates, that the error consists of bias + variance + irreducible error, and so on.
Writing the book Supervised Machine Learning for Science made me think more deeply about generalization. And more often than not, we need more than generalization from seen to unseen.
Seen-to-unseen-generalization is narrow. This type of generalization only works when the training data and the application data come from the same fountain (read: are samples from the same distribution). Plus, ideally, the training data are independent, and please, please, no distribution shifts. These scenarios exist mostly in introductory courses in machine learning.
What we usually desire is a generalization from training to application. However, the application data will rarely behave just like the training data. Training-to-application is a much more tricky type of generalization.
There’s a third type of generalization, which I would call sample-to-population. This type of generalization is when you transfer any kind of insights from your model to a larger context. Some examples:
A researcher uses ML to predict almond yield and plots a partial dependence plot for the effect of fertilizer. The plot is interpreted to inform farming practices.
A paper claims that they can diagnose pneumonia from chest X-rays better than radiologists.
A company uses machine learning predictions to report the average churn probabilities for certain subgroups of customers.
In all these cases the modeler might claim that it’s only a statement about the training data. Or, more typically, they don’t spell out how far their claim will carry. But as a statistician, I’ve learned that such results are at risk of being generalized to a larger setting. If not by the modeler, then by someone else.
But to achieve sample-to-population generalization, you need to pay special attention to the data. Especially about whether or not they are representative of the population about which you make a claim. If you say that your model is better than a radiologist, you have to ask: What does the test set represent? Does it represent a typical set of X-ray images a radiologist sees annually? The same distribution of healthy/non-healthy? Similar difficulties? Similar quality?
No Free Dessert
The No Free Lunch theorem in machine learning says that there is no best predictor when we average the performance across all possible problems. But all predictors perform equally well.
It isn’t what we observe in reality of course: a deep neural network works better than a random number generator because when we deploy our models they are not confronted with “all possible problems”, but we can trust that the world has some predictability.
A takeaway from the No Free Lunch Theorem is that even for the simplest form of generalization, the one from seen-to-unseen data, you have to make assumptions. You have to take an inductive leap. Like assuming that unseen data will be from the same distribution.
And, unfortunately, there is no free dessert either. Even if we have established seen-to-unseen generalization, we have to “pay” for any further generalization. When it comes to generalization from training to application or from sample to population, we need to make even more assumptions and put in extra effort. And sometimes we might not achieve them after all. Generalization is never free.
How do you pay for training-to-application generalization?
Describe how the training data was generated
Identify selection biases in the data-generating process
Make assumptions about the distribution of the application data
Anticipate and prepare for distribution shifts
Monitor the model in production
Conduct external evaluations
Make the model more robust
The same goes for sample-to-population generalization. Many things to do:
Describe the population about which you want to make a statement
Study how representative the sample is of that population
Maybe re-weight the sample to match the population
Calibrate the model (if it’s a classifier)
Generalization is always going to be a bit messy and tedious. But I believe it’s worth having more discussions about generalization and being more explicit about the reach of the models.