14 Comments
Sep 5Liked by Christoph Molnar

Agree on the whole article except- do nothing is not the right advice. You use cost sensitive learning or class weighted cost function

Expand full comment

Great write up.

Expand full comment

I appreciate your perspective on the default strategy for imbalanced classification tasks, but I'd like to share a real-life scenario where "doing nothing" was not a viable option.

In our case, we were dealing with a dataset comprising billions of training examples, making it practically impossible to train a model with all the available data. Additionally, we faced the challenge of having very few positive examples, and achieving a high recall was crucial for our application.

If we had followed the "Do Nothing" approach and sampled the data uniformly, it would have resulted in the loss of most of the positive examples from the training data. This would have severely impacted the model's ability to learn from these crucial instances.

To overcome this challenge, we opted for a different approach. We decided to use all the available positive examples and then sample the negative ones. To ensure that the model didn't become overly biased towards the positive class, we assigned larger weights to the sampled negative examples. This approach allowed us to strike a balance and create a model that was sensitive to both positive and negative instances.

I acknowledge that our situation was unique due to the exceptionally large dataset we had to work with. However, it highlights the importance of considering the specific characteristics of the dataset and problem at hand. In some cases, "doing nothing" may not be an effective strategy, as it can lead to a low recall model, especially when you need to sample the data and positive examples are scarce.

Expand full comment

Lol, if you do nothing, you will fail DS onsites and take home assignments as the interviewers are all used to treating class imbalances. You've been warned

Expand full comment

Thanks for your article.

While reading, two questions came up to my mind:

1.) Does the fact that smote is applied to a multiclass vs. a binary classification accentuate or diminish any of the listed flaws?

2.) Does the way step_smote() is implemented in the tidymodels package called themis, in which the oversampling is achieved by using a nearest neighbours approach, address any of the drawbacks listed?

Thanks for sharing your opinion on this topic

Expand full comment

Can you add some citations? I have several concerns with your thinking here and I can clearly point out that adding more samples is not always more informative, and removing samples can improve model performance. I do understand the motivation behind your thinking, if one type of sample appears 99% of the time, then you want the model to see it 99% of the time to calibrate it; For probabilistic models like Gaussian processes, this would be reasonable, for neural networks, not so, as they will often lead to the model not learning the minority class patterns at all. Also, once patterns have been learnt for a common class, future similar samples have very little information and may as well be ignored, look at things like informative sampling, information gain, KL divergence and similar metrics. Without citations this has little weight, but interesting thinking.

Active learning is another great place to understand how each sample can give varying levels of information gain. Here's an interesting paper around the topic:

https://openreview.net/pdf?id=UVDAKQANOW

I would love to hear your views and open up on this very practical topic of how to handle imbalanced datasets. Thanks for the post.

Expand full comment