Defending machine learning in a room full of old-school statisticians
A story of my little statistical rebellion
In 2013, I participated in a Master’s seminar on statistical modeling. To get credits we had to present a paper and discuss it with the prof and the other students. We got to choose our own topic to present and I picked the famous "Statistical Modeling: The Two Cultures" by Leo Breiman.
If you are not familiar with the paper, it’s about two different approaches to how we can model data. We can either try to come up with a classic statistical model — random variables, interpretable models, coefficients, and all — that explains some phenomenon or we can make it a prediction problem and allow more “algorithmic”-driven modeling. Leo Breiman argued that we should use more of the second type, algorithmic modeling, which I would call machine learning. See also The Statistician Who Loved Machine Learning. The paper was quite controversial, but for me, it was an eye-opener.
The seminar was for my Master’s in statistics, which I followed up after a Bachelor’s in statistics. You could say that my education was in the first type of modeling, classic statistical modeling, which Breiman called “data modeling”. But between my Bachelor's and Master’s, I started diving into machine learning, competing in Kaggle challenges, and reading up on all machine learning approaches in “Elements of Statistical Learning”.
I was hooked on the second type of culture, the “algorithmic modeling” culture.
When I presented the Two Cultures, I also took on Breiman’s position that statisticians have a blind spot and should favor machine learning more. If you are interested, my slides are still online.
After telling all the statisticians in the room that they should do machine learning, a discussion ensued.
On one side was me siding with machine learning. I had discovered for myself random forests, the idea of the generalization error, and other wonders of predictions. And I was willing to defend my new views.
On the other side were the prof, his assistant, and most of the students siding with classical statistical modeling. Makes sense since all you learned at that Statistics Institute, at least at my time, was classic statistical modeling.
The prof argued that this over-parameterized, hyper-prediction-focused approach (aka machine learning) would only mean that the model will "parrot the error" of the data. I understood this critique as the danger of overfitting. So I repeatedly explained that this problem is "solved" by separating training, tuning, and evaluation steps. But we never reached a common ground in the discussion.
How could the prof not understand the ML mindset? Today I interpret the prof’s statement about "parroting the error" differently. He might have meant that, in general, one should have lots of functional constraints on the model (like linearity) for reasons of robustness and interpretability.
Another part of the discussion was about the general lack of interpretability with machine learning. The discussion drifted a bit in the direction of distribution shift where model interpretability would not only help you detect but actually understand the distribution shift.
Despite the clash in mindsets, the discussion was very friendly. I don’t think I convinced anyone to try out machine learning, but it was great food for thought for me and hopefully for the others as well.
Lessons Learned
Not many people realize how caught up they are in their current modeling mindsets. Me included, at least at the time. I’m trying to be more conscious now about the assumptions I make and have an overview of other modeling mindsets or cultures out there.
The differences in mindset are truly cultural. This time around I understand the arguments are not to be interpreted as stand-alone but embedded in the modeling mindset or culture they are coming from.
Interpretability is a major aspect for many modelers. I see this today with practitioners using machine learning to solve their problems. While interpretability isn’t “solved” for machine learning, we have made huge steps and have many tools at our disposal. See my book Interpretable Machine Learning.
Just a quick note.
Being a classically trained statistician, I am always surprised that people claim ML is newish. Most statistical methods for prediction were well known by the time of Leo's text (it was assigned reading when it came out in my first year of PhD Math Stat work in 2001, and I do not remember anyone finding it controversial or even that novel. Although post graduation in 2006 it did seem to have a much bigger impact on the applied stats/biostats communities at my next department causing a rather big debates about who was a 'real ' statisticain vs just a computer scientist. I was rather confused to say the least.)
McLachlan's wonderful book on discriminant analysis and statistical pattern recognition is my personal go to for all the methods pre 1992, with elements of statistical learning filling in the gap up to 2010'ish. Now I tend to use the two new ProbML books by Murphy. Of particular importantance are the classical works for Atchison and Dunsmore on statistical prediction from the 70's and the related work on density estimation by Silverman. (Oh shoot almost forgot Brian Ripley's texts, which I was actually trained in grad school out of in early 2000's!!! We just tended to call these subjects multivariate statistics for some reason...)
I guess in general I do not think that statiaticains ignored the problems in ML, there was just more money in inference for clinical trials, basic science, and economics focused understanding why something happened at the time. (vs being able to predict the future.... As in ML or pattern recognition.) Also bigish data sets removed much of the need theoretical foundations at the time period of mid 2000's which opened the door for ignoring the efficient use of observations, considerably lowering the theoretical threshold for entering into the pattern recognition games.
Part of the problem is that there has been a radical renaming of algorithms in ML... As an example if I am talking to someone in math stats I will typically use the language of empirical processes where for ML I will use language of PAC learning. Or maximum Likelihood estimates of discriminate functions(stats) or Bayesian classification methods(ML).
Larry Wasserman's all of statistics has a good break down of the break down in communication between stats and CS people.
Anyway enjoyed your article! Thank you.
Great piece, matches my exp with statisticians very closely (and by background I am one ;-)
- I totally agree with them on the need for interpretability, but their go-to example is logistic regression: how exactly is fixing all variables except for one realistic?
- my personal rabbit hole: I started checking how many papers in stats / econometrics report performance on training set ONLY. Maybe I have sample selection problem, but I walked away horrified.