In a previous post, I argued against SMOTE or other oversamplers to “fix” your imbalanced classification tasks.
Imabalanced classification: If you have two (or more) classes and one of the classes is more frequent than another. For example 99% versus 1%.
Instead, you should “Do Nothing”.
“Do Nothing” is meant to be a default for imbalanced classification, not a law. You should only “Do Something” for a good reason.
I’ll explain why “Do Nothing” is a good default and when “Do Something” is better.
Why “Do Nothing” should be the default for imbalanced data
For classification, we have three general evaluation choices:
Class-based metrics, like accuracy or F1. Depends only on the classifications, not the scores.
Score-based metrics, like log-loss
Calibration: how well the output probabilities match the actual probabilities.
Doing nothing with the imbalanced data is a good choice for all three metrics:
If you have a strong classifier (like xgboost), doing nothing will be your best choice since oversampling doesn’t help (it just breaks calibration).
For class-based metrics, you need a threshold for classification for your scores. It’s better to tune the threshold instead of fudging the data.
But there are many cases where you should address the imbalance and we’ll discuss them now.
When to override the default for imbalanced data
When I posted the “Do Nothing”, I got some objections. Many were valid, but my recommendation stands. “Do Nothing” isn’t a law, but a default. Defaults are starting points and you should override them when you have a good reason. Hint: Imbalance alone is not a good reason.
The default of doing nothing only works well in the following scenario: Making an error is equally costly for all data points and you expect the distribution of your data to be similar to future distributions.
Let’s talk about some reasons to intervene in the class imbalance. The solution, however, is still mostly not to use SMOTE, but rather to adapt the design of task and evaluation:
You may have additional information about your data indicating that your class balance in your training data is different from the target distribution (like the future production data). → Most ML algorithms allow weighting data points, which means you can give underrepresented data points a higher and overrepresented a lower weight.
The cost of misclassification may be different between the classes. → Use cost-sensitive machine learning to train models.
You may be using a class-based metric that is severely affected by the imbalance → Do threshold tuning on validation data.
You may be using a weak classifier like Ripper, CART, or C4.5. → Here SMOTE can be beneficial.
The data may be extremely large. → Over-sampling may be useful for technical reasons, see this comment.
This list isn’t exhaustive, but it shows that addressing imbalance usually doesn’t require fudging with the data.
Even if you do nothing, many models may not be well-calibrated. I recommend applying conformal prediction, a post-hoc uncertainty quantification method that can calibrate the probabilities.
Do you have good recommendations for where we can find code implementations (Python and/or R) where cost sensitive ML and threshold tuning on validation data are implemented? Especially in the case of the latter I haven't found an end to end walkthrough, although my search may just be ineffectual. For weighting data points, I know that tidymodels in R now support case weights which seem to be a good implementation.
Great post as usual.
Is there any reason not to always utilize sample weights for classification when they're available?
Initially, it seems that either results will improve for unbalanced data, or in the ideal scenario where your data is perfectly balanced, nothing would change. However, I'm intrigued why this isn't the default choice in machine learning libraries.