5 Comments
Jan 9Liked by Christoph Molnar

Hey Christoph, great post.

One of the most common mistakes I see people make with respect to non-i.i.d. data is with data augmentation. As you explain in the post, the details of the data splitting are crucial. In NLP projects I have often seen people augmenting their data and then cross validating on that augmented set. However, this order of execution leads to examples that are small variations of the training data to leak into the test set. This way, augmentation always seems to improve the system 😏. To assess the effect of augmentation on test performance correctly, however, you have to first split the data and then only augment your training data.

Expand full comment
Jan 9Liked by Christoph Molnar

In economics, group dependencies like classroom effect that you've mentioned are taken into consideration. When they do modelling they usually go for wild cluster bootstrap.

There is a R package called fwildclusterboot made by Alexander Fischer and David Roodman. Their package is based on https://econpapers.repec.org/paper/qedwpaper/1406.htm ( Roodman et al (2019)) and https://www.econ.queensu.ca/sites/econ.queensu.ca/files/wpaper/qed_wp_1485.pdf (MacKinnon, Nielsen & Webb (2022))

Expand full comment

Do you recommend any book about this topic where we can learn more about it? :)

Expand full comment