Imbalanced data? Why "Do Nothing" should be…

Sep 19, 2023

And when “Do Something” is better.”

3 Comments

Sep 19, 2023

Do you have good recommendations for where we can find code implementations (Python and/or R) where cost sensitive ML and threshold tuning on validation data are implemented? Especially in the case of the latter I haven't found an end to end walkthrough, although my search may just be ineffectual. For weighting data points, I know that tidymodels in R now support case weights which seem to be a good implementation.

Great post as usual.

Expand full comment

Reply (1)

Tom Cal

Feb 25

Junaid, Re: "code implementations where cost sensitive ML and threshold tuning on validation data are implemented"

These resources at scikit-learn.org mightbe helpful:

"Post-tuning the decision threshold for cost-sensitive learning",

webpage, https://scikit-learn.org/stable/auto_examples/model_selection/plot_cost_sensitive_learning.html

notebook, https://scikit-learn.org/stable/_downloads/133f2198d3ab792c75b39a63b0a99872/plot_cost_sensitive_learning.ipynb

Here are related videos:

"Get the best from your scikit-learn classifier: trusted probabilities and optimal binary decision", https://pretalx.com/euroscipy-2023/talk/GYYTCH/

https://www.youtube.com/watch?v=6YnhoCfArQo

“Imbalanced-learn: regrets and onwards - with Guillaume Lemaitre, core-maintainer [of scikit-learn and imbalanced-learn]”, https://youtu.be/npSkuNcm-Og

Expand full comment

Jesuino Vieira Filho

Mar 14, 2024Edited

Is there any reason not to always utilize sample weights for classification when they're available?

Initially, it seems that either results will improve for unbalanced data, or in the ideal scenario where your data is perfectly balanced, nothing would change. However, I'm intrigued why this isn't the default choice in machine learning libraries.

Expand full comment

Mindful Modeler

Imbalanced data? Why "Do Nothing" should be…