3 Comments
User's avatar
yashas Nadig's avatar

Any specific advantage of using elo like benchmarking instead of general benchmarking like how we do with LLMs?

Christoph Molnar's avatar

These are different situations:

For LLMs, we can't boil it down to one single metric, because LLMs are so general-purpose. And the metrics that we optimize during training are not the same as we would use for evaluation.

For tabular supervised machine learning, we can easily say which model is better. At least for a given task, because we typically have a meaningful evaluation metric. But across different tasks with different metrics (regression, binary classification, and multi-class), we can't aggregate the metrics directly. So Elo is one way to aggregate across tasks. In my opinino a meaningful one, but with caveats cause it drops the margin by which models are winning.

Victor GUILLER's avatar

Thanks for this article presenting TabArena, very instructive !

Looking at its recent use for modern AutoML and TFMs models, I have some questions about TabArena:

- How diverse and representative are these final 51 datasets regarding real use case scenarii ?

- How can we ensure that one (or several) of these validation datasets is not highly similar to one of the synthetic datasets the TFMs were pre-trained on (which could lead to data leakage and over-optimistic performances assessment) ?

- As Cassie Kozyrkov explains in one of her ML lectures, "repeatedly validating model after model pollutes your validation data and erodes your protection against overfitting". In this case, is it planned to add or change datasets in TabArena to ensure newer models may not overfit on the validation sets ?

Curious about your thoughts on these questions :)