One of the most common mistakes I see people make with respect to non-i.i.d. data is with data augmentation. As you explain in the post, the details of the data splitting are crucial. In NLP projects I have often seen people augmenting their data and then cross validating on that augmented set. However, this order of execution leads to examples that are small variations of the training data to leak into the test set. This way, augmentation always seems to improve the system 😏. To assess the effect of augmentation on test performance correctly, however, you have to first split the data and then only augment your training data.
In economics, group dependencies like classroom effect that you've mentioned are taken into consideration. When they do modelling they usually go for wild cluster bootstrap.
We faced a similar choice while creating a credit risk model . When predicting the default probability of a deal should you model on the deal level or on the deal-client level. If you model on the deal-client level, then the data points are non i.i.d for a particular deal . This would create a small clusters of non i.i.d data. GroupedKfold should alleviate some of the issues. Better approach is to model on the deal level, so your data is i.i.d
Hey Christoph, great post.
One of the most common mistakes I see people make with respect to non-i.i.d. data is with data augmentation. As you explain in the post, the details of the data splitting are crucial. In NLP projects I have often seen people augmenting their data and then cross validating on that augmented set. However, this order of execution leads to examples that are small variations of the training data to leak into the test set. This way, augmentation always seems to improve the system 😏. To assess the effect of augmentation on test performance correctly, however, you have to first split the data and then only augment your training data.
Great example. I guess data augmentation falls into the "repeated measurements" type of correlation.
In economics, group dependencies like classroom effect that you've mentioned are taken into consideration. When they do modelling they usually go for wild cluster bootstrap.
There is a R package called fwildclusterboot made by Alexander Fischer and David Roodman. Their package is based on https://econpapers.repec.org/paper/qedwpaper/1406.htm ( Roodman et al (2019)) and https://www.econ.queensu.ca/sites/econ.queensu.ca/files/wpaper/qed_wp_1485.pdf (MacKinnon, Nielsen & Webb (2022))
Nice post !
We faced a similar choice while creating a credit risk model . When predicting the default probability of a deal should you model on the deal level or on the deal-client level. If you model on the deal-client level, then the data points are non i.i.d for a particular deal . This would create a small clusters of non i.i.d data. GroupedKfold should alleviate some of the issues. Better approach is to model on the deal level, so your data is i.i.d
Do you recommend any book about this topic where we can learn more about it? :)
I don't know a good book recommendation about non-i.i.d. + ML. I'll let you know when I find some