The Way Of Model-Agnostic Machine Learning
What it means to always pick the best perfoming model
One of my first steps into machine learning was the book “Elements of Statistical Learning”.
The book covers the theory behind machine learning models such as SVMs, gradient boosting, and neural networks. I was fascinated by all the different motivations and workings behind these ML algorithms. The kernel trick for SVMs, the de-correlated ensembles of trees in random forests, and the beauty behind the gradient updates in boosting.
Not only do models differ in their algorithms, but they come with different perks or entirely different philosophies. Boosting allows you to fit additive models, random forests have a built-in measure of feature importance, Bayesian models come with an entire philosophy and interpretation of probability, and regularized linear models (LASSO) can do feature selection.
Choosing a model class felt like a personal choice, like picking a character in a video game. Do you want to play the archer? A mage? Or wield swords? I became a fan of random forests, because they usually work reasonably well, and have built-in interpretability and other tools.
Picking favorites is ultimately a contradiction with the following maxim:
Select the model with the best predictive performance on out-of-sample data.
And you simply can’t know, for a new task, which model will work the best. In machine learning, this maxim isn’t really controversial or surprising. But it’s not always followed to its full conclusion.
Model-agnostic modeling as a consequence of model selection
Let’s assume your main goal is to select the best-performing model, based on an evaluation setup with out-of-sample data.
Naturally, there will be constraints, like kicking out models that are too slow or not using certain features because of legal reasons. But all constraints are chosen because of practical reasons that come from the “outside” and not “inside”. An “inside”-example: data scientists on the team using random forests because they already have a script for it, or using linear regression because “that’s how we’ve always done it.”
Taking the best-performance maxim to its full conclusion means that any model could be the best-performing one. Therefore you shouldn’t be attached to the particular outcome of your model selection process. But if any model can come out of the process, this has a lot of implications for everything downstream of your model.
Taking performance-driven modeling to it’s full conclusion means embracing model-agnostic machine learning.
For most machine learners, that’s not a secret. If you have ever participated in a Kaggle challenge, you don’t come far without a model-agnostic mindset. The leaderboard doesn’t care which model you used. But still, let’s walk through what the way of model-agnostic machine learning entails:
Software: Ideally, use AutoML or software services that do AutoML for you. Or you might have your own tuning and model selection pipeline (e.g. in sklearn).
API: Whatever the final model, the API should be the same. That’s why sklearn is such a strong library: it gives different types of models the same interface. Making predictions with a linear model is the same as for a random forest. Switching out the model can be as short as a 1-line change.
Use model-agnostic interpretation methods such as permutation feature importance, ALE plots, and Shapley values. Ignore model-specific interpretations like the random forest feature importance.
Choose the evaluation metric before making modeling choices. Don’t use metrics like AIC or BIC that are based on model assumptions.
Use model-agnostic uncertainty quantification such as conformal prediction: conformal prediction produces prediction sets and intervals without relying on model-specifics, in contrast to, for example, Bayesian posteriors. You can find my Introductory book on conformal prediction here.
Choose model constraints based on the task, not based on your model choice. For example, only restrict a feature effect to be linear because of domain knowledge, not because it’s convenient for fitting and interpretation.
Model-agnostic machine learning means fully embracing the fact that any model might be the best-performing and acting accordingly.
Why machine learning is often model-specific
Not all accept that best-performance maxim or they have a mix of modeling goals. If you a priori (pun intended) only consider Bayesian models, you don’t follow the maxim, but you have other guiding principles for modeling.
But even if you want to follow the maxim, you might not be able to: the context of the model and of your work might make it hard to do so in a model-agnostic way. A handful of examples:
Your client might be used to the interpretation of linear regression models and therefore not accept other models.
The model in production might be rather entangled with the rest of the backend. For example, because it’s a hard-coded tree, or it’s written in a programming language that doesn’t offer many machine learning algorithms.
Switching out the model would have a domino effect of changes in downstream software services.
You have a favorite algorithm.
There isn’t enough time and resources to do proper model selection. So just one go-to model is trained.
The data are prepared in a very specific way for a specific ML algorithm. Exchanging this would be tedious.
The hardware is only suitable for a subset of algorithms. For example, no GPU is available.
The people working with the model are used to the specific way the model is to be interpreted, e.g. decision rules.
Uncertainty quantification is based on the variance between trees in a random forest. And this metric is used in dashboards, reports, etc., which would be painful to change and educate everyone about.
You want to publish a paper, but the common culture in your field would only accept e.g. GAMs and similar models and you would have a hard time arguing for your xgboost model.
The underlying reasons for staying model-specific range from cultural and technical reasons to convenience. So in real-life, model-agnostic modeling is more of a scale than an absolute yes or no question.
Rule of thumb: The more you have to change the environment after exchanging the model, the less model-agnostic the setup is.
Model-agnostic modeling — Why now?
As I’ve argued, by following the pure best-performance maxim, you end up with a model-agnostic setup, at least in theory. But that isn’t new or surprising. But what’s different now is that many puzzle pieces needed for model-agnostic modeling are coming together, many just very recently. Here are some factors that allow model-agnostic modeling in practice:
We have made great leaps in model-agnostic interpretation methods. Before that, you were mostly limited to picking inherently interpretable models if interpretability was important.
But also in other areas, we get more tools. For example, for uncertainty quantification, we have conformal prediction, which is a hot research topic at the moment.
More and more companies offer machine learning as a service.
With every improvement in processors, it becomes easier to tune and compare models.
Machine learning becomes more and more of a commodity, a function that you just call with an API to get a prediction.
The speed of machine learning research is fast. The more model-agnostic your work, the better your implementations will last over time.
Personally, I’m very fascinated by all of the model-agnostic “periphery”. That’s why I wrote Interpretable Machine Learning, which focuses on model-agnostic methods, and Introduction to Conformal Prediction with Python. These are my personal investments in model-agnostic modeling, which I believe to be a good bet for the future
In other news: the print version of the conformal prediction book is coming along
I’m working on the print version of the conformal prediction book. Should be ready next week or the week after. I already got the test print here and I’m making some adjustments now. Stay tuned.
If you haven't, check out the ebook version on Leanpub: