Two days ago, I talked to a causal ML researcher. They publish on ML conferences and one problem they have is reviewers asking for benchmarks1 that don’t make much sense. This gave me a flashback to my PhD on ML interpretability where I also had these discussions. Interpretable ML methods don’t lend themselves to benchmarks because we don’t have real dataset with the ground truth SHAP values or the ground truth partial dependence curves.
You could easily dismiss it as bad luck with reviewers unfamiliar with causality or interpretability. But the issue runs deeper. It’s a machine-learning mindset thing. Especially in supervised machine learning it’s all about beating the benchmark, beating the state-of-the-art, winning the Kaggle competition, and finding the best hyperparameters. We see a similar thing happening with LLMs, which is also about which LLM is better in which benchmark.
I’m torn by this hyper-focus on competition and benchmarks. Ruthless evaluation is a big reason machine learning works well for many tasks. It acts as a great filter and it gave us xgboost and transformers. But this focus can also hinder other lines of research and modeling considerations (causality, robustness, interpretability, …). The obsession with benchmarks and SOTA runs deep:
People on social media arguing over which ML algorithm is better.
Difficulties in publishing new approaches that don’t beat the state-of-the-art.
LLM evaluation based on benchmarks even when they start memorizing them.
The hope is that the performance on these benchmark tasks and datasets are predictive of performance on new datasets. Ideally, the benchmark datasets are representative of the typical dataset you would work on in the future. But it’s not like we can sample from the distribution of datasets. Benchmarks are guided by what datasets are openly available (huge selection bias already) and which datasets are convenient to use (for example in clean CSV format and not in some wild Excel construct). Benchmarks are not representative samples, they are arbitrary samples.
So while benchmarks are essential, we shouldn’t be too obsessed with them. The no-free lunch theorem and Goodhart’s law also advice us not to.
And with benchmarks, I mean comparing with some ground truth in real data.
New benchmark for benchmark takes