Stuck on Benchmark Island

When Model Performance Becomes A Self-Absorbing Goal

Jul 16, 2024

A lot of machine learning research has detached itself from solving real problems, and created their own "benchmark-islands".

How are benchmark islands created? Why do researchers choose to live on them?

Aerial view of an island with a paper factory in ink and wash style. The factory has smokestacks and large buildings surrounded by greenery and trees, illustrated with detailed ink outlines and subtle watercolor washes. A cargo ship near the shore is being loaded with paper. The ocean around the island is calm and blue. A small dock extends from the island to the ship. The sky is clear with a few clouds. — Welcome to Benchmark Island!

Let’s look at an example of a benchmark island: hundreds, probably thousands of papers about COVID-19 diagnosis from X-ray images. The initial motivation: A COVID-19 diagnostic tool based on X-ray images would be useful in everyday clinical work.

The first papers published on this topic paved the way for more research:

They have established motivation and relevance for this topic. A precedence others can refer to.
The papers might come with open data and code. They did so in the case of COVID-19 x-ray images.
They provide a baseline to which to compare new models. This allows the creation of benchmarks.

To create a benchmark island, we now need more researchers to get into the field. While they might re-state the motivation (COVID-19 classification), it becomes more of a comparison paper. This is an important aspect because this changes the research objective. Predictive performance becomes the measure of progress, even though improvements are getting smaller. The original research motivation has become an unchallenged assumption. The paper only needs a sprinkle of novelty and to “beat the benchmark”, to get published.

At some point this becomes a little new community, citing and reviewing each other — a benchmark island.

This is no individual researchers’ fault but a systematic problem:

There is a strong pressure to publish (publish or perish) and a benchmark island provides a convenient publishing environment.
It’s easier to get funding and traction for “trendy” topics. What is better than mixing AI and world events?
If a single paper is out of touch with reality, that’s not a problem. But if a lot of them are, it is.
More speculative, courageous, and interdisciplinary research is difficult to perform, to get funded, and to publish.

Am I being too critical here? Maybe this particular benchmark island is slowly being reconnected to the Real World ™️. Maybe each paper is a crucial building block toward the initial motivation. Well, Wynants et al. (2020) did a systematic review of 232 prediction models for COVID-19. Only 2 were promising. All the others had problems that made them unusable. Conclusion: it’s more than fair to criticize this benchmark island.

Actual progress would mean asking the difficult questions: Do clinicians actually need this? How can we implement this? This type of work is difficult, interdisciplinary, and stretches the bounds of academia. Much less convenient than tinkering with ML algorithms.

Hause

This "benchmark island" problem is actually identical to the one in many social science disciplines that rely on experimentation (known as mutual-internal-validity problem). Paper in case you're interested (and it's my paper): https://doi.org/10.1177/174569162097477

Expand full comment

2 replies by Christoph Molnar and others

Gabriel

I think another reason for these benchmarks island is that academia, maybe due to their limited access to real-world data and industrial use-cases, focuses mainly on the learning algorithm innovations. Breaking this trend would require more collaborations with industry, but such collaborations are hard when the company is not open to publish its research.

2 more comments...

Mindful Modeler

Discussion about this post