Don't "fix" your imbalanced data

Why SMOTE doesn't work for most cases.

Sep 05, 2023

Imagine you have data from healthy subjects and patients with a disease. Out of your 10,000 data points, only 50 have the disease (0.5%). Your goal is to classify the disease based on e.g. blood work.

A typical case of imbalanced data.

What do you do?

A commonly suggested “solution” to the imbalanced data problem is to change the data. You can either oversample the minority class (or create synthetic cases) or you can downsample the majority class. Or do both.

I’ve always found that odd, but as a student, I accepted this solution to imbalanced data.

Today I’m telling you: Don’t do it.

“Fixing” imbalanced data by sampling is (mostly) a stupid idea. This advice might be a “The Emperor’s New Clothes” type of situation1, where people doubted this advice privately but didn’t question it publicly. At least that was my impression. But it turns out that the emperor is in his underwear and that sampling approaches don’t work. At least in most cases (more about that later).

Our punching bag for this post will be SMOTE. With over 25k citations, it’s one of the most famous “fixes” for imbalanced data in machine learning.

SMOTE

SMOTE stands for Synthetic Minority Over-sampling Technique. The idea is to oversample the minority class, but instead of CTRL-C CTRL-V the data, SMOTE creates synthetic data for the minority class. SMOTE also recommends undersampling of the majority class. That’s all you need to know.

If your sample represented the population before, SMOTE ensures that it no longer does. But why would you train a machine learning model on a 50% / 50% balance if the class balance in future data is expected to be 95.5% / 0.05%?

SMOTE destroys calibration. Calibration means that e.g. a probability of 30% really means 30%. That is, if you bundle all 30% classifications of a well-calibrated classifier together, you will find that 30% of them, in expectation, matched the ground truth. Not that classifiers are necessarily calibrated by default. However, by sampling to change the distribution of the target classes, you ensure that your model is not calibrated.

Another fun fact. You don’t add new information to your data by over- or under-sampling data. If you under-sample the majority class, you actually reduce the amount of information for your model.

To Smote Or Not To SMOTE?

What about the original SMOTE paper? Did they not evaluate their approach? Did they make a mistake? Or did the machine learning community collectively agree on the wrong take-home message for imbalanced data?

It’s the latter.

Here’s what the abstract says:

This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier.

What a weird choice of models, right? C4.5, Ripper and Naive Bayes are all so-called weak learners. A weak learner is a machine learning model that won’t make you rich on Kaggle. Examples of strong learners, on the other hand, are xgboost, LightGBM, CNNs, transformers, and random forests.

Could it be that SMOTE actually does work, but only for some cases?

That’s what the authors of the paper To SMOTE, or not to SMOTE? found out: SMOTE works, but only for weak learners, and if you use measures such as AUC and don’t care about calibration. SMOTE didn’t improve performance for strong classifiers.

Don’t use SMOTE if you use state-of-the-art classifiers. Don’t use SMOTE if you care about calibration.

So the original SMOTE paper is actually fine. The authors are not scammers who sell magic clothes to the ML community. It’s just that we as a community have to realize that the scope of model classes and situations where SMOTE works is narrow.

What do you do instead when you have imbalanced data? Well, that’s something for another newsletter issue.

But for now, a good default strategy for imbalanced classification tasks: Do Nothing!

The Emperor’s New Clothes is a fairytale where a scammer sells “clothes” to the emperor that he claims are only visible to clever people. Well, the clothes don’t exist, but everyone goes along with it because no one wants to be seen as dumb.

Mindful Modeler

Discussion about this post