Discussion about this post

User's avatar
yashas Nadig's avatar

Any specific advantage of using elo like benchmarking instead of general benchmarking like how we do with LLMs?

Victor GUILLER's avatar

Thanks for this article presenting TabArena, very instructive !

Looking at its recent use for modern AutoML and TFMs models, I have some questions about TabArena:

- How diverse and representative are these final 51 datasets regarding real use case scenarii ?

- How can we ensure that one (or several) of these validation datasets is not highly similar to one of the synthetic datasets the TFMs were pre-trained on (which could lead to data leakage and over-optimistic performances assessment) ?

- As Cassie Kozyrkov explains in one of her ML lectures, "repeatedly validating model after model pollutes your validation data and erodes your protection against overfitting". In this case, is it planned to add or change datasets in TabArena to ensure newer models may not overfit on the validation sets ?

Curious about your thoughts on these questions :)

1 more comment...

No posts

Ready for more?