15 Comments
Sep 5, 2023Liked by Christoph Molnar

Agree on the whole article except- do nothing is not the right advice. You use cost sensitive learning or class weighted cost function

Expand full comment
author

Yeah, I will discuss those in a later post. With the "do nothing" kind of assumed that the classification task is otherwise set up in a meaningful way -- including the right evaluation metric. And if costs of misclassifications differ, this should be reflected in the evaluation metric (and loss function) before you even know the class frequencies. (in theory)

Expand full comment

Great write up.

Expand full comment
author

Thanks!

Expand full comment

I appreciate your perspective on the default strategy for imbalanced classification tasks, but I'd like to share a real-life scenario where "doing nothing" was not a viable option.

In our case, we were dealing with a dataset comprising billions of training examples, making it practically impossible to train a model with all the available data. Additionally, we faced the challenge of having very few positive examples, and achieving a high recall was crucial for our application.

If we had followed the "Do Nothing" approach and sampled the data uniformly, it would have resulted in the loss of most of the positive examples from the training data. This would have severely impacted the model's ability to learn from these crucial instances.

To overcome this challenge, we opted for a different approach. We decided to use all the available positive examples and then sample the negative ones. To ensure that the model didn't become overly biased towards the positive class, we assigned larger weights to the sampled negative examples. This approach allowed us to strike a balance and create a model that was sensitive to both positive and negative instances.

I acknowledge that our situation was unique due to the exceptionally large dataset we had to work with. However, it highlights the importance of considering the specific characteristics of the dataset and problem at hand. In some cases, "doing nothing" may not be an effective strategy, as it can lead to a low recall model, especially when you need to sample the data and positive examples are scarce.

Expand full comment
author

Hey Alan, thanks a lot for sharing this case.

Billions of data ... that's a lot. To clarify: if you have more data than you need or that you can handle, downsampling is often the easiest way to make a task feasible.

The "Do Nothing" recommendation was certainly a bit punchy, but I think it applies to a lot of cases where people would default to over- or undersampling. Having such a large dataset is a good example where undersampling may be the only good option.

Expand full comment

A lot of good points.

How I usually approached this: Give up-/downsampling strategies a try, but validate always on non-resampled data (as you emphasized: not looking at calibration at this point). I experienced improvements in model performance even for 'strong learners' like LGBM.

Any red flags about this approach I am not aware of?

Expand full comment

Lol, if you do nothing, you will fail DS onsites and take home assignments as the interviewers are all used to treating class imbalances. You've been warned

Expand full comment

Thanks for your article.

While reading, two questions came up to my mind:

1.) Does the fact that smote is applied to a multiclass vs. a binary classification accentuate or diminish any of the listed flaws?

2.) Does the way step_smote() is implemented in the tidymodels package called themis, in which the oversampling is achieved by using a nearest neighbours approach, address any of the drawbacks listed?

Thanks for sharing your opinion on this topic

Expand full comment

Furthermore, can startifying the target variable in the data splitting and resampling have a mitigating effect?

Expand full comment

Can you add some citations? I have several concerns with your thinking here and I can clearly point out that adding more samples is not always more informative, and removing samples can improve model performance. I do understand the motivation behind your thinking, if one type of sample appears 99% of the time, then you want the model to see it 99% of the time to calibrate it; For probabilistic models like Gaussian processes, this would be reasonable, for neural networks, not so, as they will often lead to the model not learning the minority class patterns at all. Also, once patterns have been learnt for a common class, future similar samples have very little information and may as well be ignored, look at things like informative sampling, information gain, KL divergence and similar metrics. Without citations this has little weight, but interesting thinking.

Active learning is another great place to understand how each sample can give varying levels of information gain. Here's an interesting paper around the topic:

https://openreview.net/pdf?id=UVDAKQANOW

I would love to hear your views and open up on this very practical topic of how to handle imbalanced datasets. Thanks for the post.

Expand full comment
author

A part of this post is based on the "To Smote or not to Smote?"-paper: https://arxiv.org/abs/2201.08528

Expand full comment

Thanks. "To Smote or not to Smote?", that is the question! :) Any idea if this paper was published anywhere, or just a preprint on arxiv?

I noticed this paper focuses on a particular type of data and task: "challenges of imbalanced binary classification in tabular data". I have real experience with text data, and there using data augmentation to increase the minority class certainly improved performance, focal loss I don't think helped so much, but seems like a good approach too. So maybe this argument is more relevant to binary classification on tabular data?

I am not won over, but I will think twice and look more closely at the paper. I wonder what are the aspects to consider when to Smote and when not to Smote?

Thanks, Jon

Expand full comment

the paper was written recently. But this was understood by ppl doing ML long time contrary to popular opinion (including Google's own ML Crash Course).

The simple reason is, You do Class Imbalance Sampling if your data is not a representation of True Distribution. For example, if in real world, the distribution is not skewed, but your training data shows skewed, then you balance it.

But you don't balance the data, if real world is not balanced. Simple reason, by balancing the data, you are biasing the model as it is trained on a representation of world, where it expects non skewed data. So when it is deployed in real world, it will fail soon as the real world is highly skewed.

Expand full comment

I agree and also support this idea, in theory. However, take another foundational assumption about the data in machine learning: independent and identically distributed (i.i.d.*). The i.i.d. assumption is at the foundation of almost all ML models and, it is pretty much always completely wrong :D However, it works.

So I agree that the data should be presented to the model as is, so that it can learn the true distributions in the data: in theory this is great. I believe some models can do this, like statistical models, take Gaussian processes, and they need the original data as they do learn probability distributions. However, for neural networks (and note that the paper this article refers to, "To Smote or not to Smote", is focused on tabular data with lightGBM, XGBoost and Catboost, not neural networks). The issue with neural networks and class imbalance is another story, and in that story I have experience that they do not perform well without class balance, and futhermore the theory with vanishaing gradients can support this. So yeah, in conclusion, when doing binary classification on tabular data, with a random forest boosting algotrithm, think twice about using smote and more generally think twice about artificially balancing classes before training. However, and what is not made clear in this article, is that for deep learning (neural networks) there are existing issues with class imbalance, and so it's smart to use techniques to address this. Here is a recent survey on this topic:

https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0192-5

* https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables

Expand full comment