TabArena explained

Christoph Molnar

Mar 31

What chess ratings tell us about ML models

Read →

3 Comments

yashas Nadig

Mar 31

Any specific advantage of using elo like benchmarking instead of general benchmarking like how we do with LLMs?

Reply (1)

Christoph Molnar

Mar 31

These are different situations:

For LLMs, we can't boil it down to one single metric, because LLMs are so general-purpose. And the metrics that we optimize during training are not the same as we would use for evaluation.

For tabular supervised machine learning, we can easily say which model is better. At least for a given task, because we typically have a meaningful evaluation metric. But across different tasks with different metrics (regression, binary classification, and multi-class), we can't aggregate the metrics directly. So Elo is one way to aggregate across tasks. In my opinino a meaningful one, but with caveats cause it drops the margin by which models are winning.

Victor GUILLER

8dEdited

Thanks for this article presenting TabArena, very instructive !

Looking at its recent use for modern AutoML and TFMs models, I have some questions about TabArena:

- How diverse and representative are these final 51 datasets regarding real use case scenarii ?

- How can we ensure that one (or several) of these validation datasets is not highly similar to one of the synthetic datasets the TFMs were pre-trained on (which could lead to data leakage and over-optimistic performances assessment) ?

- As Cassie Kozyrkov explains in one of her ML lectures, "repeatedly validating model after model pollutes your validation data and erodes your protection against overfitting". In this case, is it planned to add or change datasets in TabArena to ensure newer models may not overfit on the validation sets ?

Curious about your thoughts on these questions :)

Mindful Modeler

TabArena explained