How to deal with non-i.i.d data in machine…

Christoph Molnar

Jan 9, 2024

Ignoring this common problem can lead to data leakage, overfitting, and overly optimistic evaluation.

Read →

6 Comments

Thomas Werkmeister

Jan 9, 2024

Hey Christoph, great post.

One of the most common mistakes I see people make with respect to non-i.i.d. data is with data augmentation. As you explain in the post, the details of the data splitting are crucial. In NLP projects I have often seen people augmenting their data and then cross validating on that augmented set. However, this order of execution leads to examples that are small variations of the training data to leak into the test set. This way, augmentation always seems to improve the system 😏. To assess the effect of augmentation on test performance correctly, however, you have to first split the data and then only augment your training data.

Expand full comment

Reply (1)

Christoph Molnar

Jan 10, 2024

Great example. I guess data augmentation falls into the "repeated measurements" type of correlation.

Expand full comment

Viet Hung Pham

Jan 9, 2024

In economics, group dependencies like classroom effect that you've mentioned are taken into consideration. When they do modelling they usually go for wild cluster bootstrap.

There is a R package called fwildclusterboot made by Alexander Fischer and David Roodman. Their package is based on https://econpapers.repec.org/paper/qedwpaper/1406.htm ( Roodman et al (2019)) and https://www.econ.queensu.ca/sites/econ.queensu.ca/files/wpaper/qed_wp_1485.pdf (MacKinnon, Nielsen & Webb (2022))

Expand full comment

Funny Panda

Oct 22, 2024Edited

Nice post !

We faced a similar choice while creating a credit risk model . When predicting the default probability of a deal should you model on the deal level or on the deal-client level. If you model on the deal-client level, then the data points are non i.i.d for a particular deal . This would create a small clusters of non i.i.d data. GroupedKfold should alleviate some of the issues. Better approach is to model on the deal level, so your data is i.i.d

Expand full comment