Do you have good recommendations for where we can find code implementations (Python and/or R) where cost sensitive ML and threshold tuning on validation data are implemented? Especially in the case of the latter I haven't found an end to end walkthrough, although my search may just be ineffectual. For weighting data points, I know that tidymodels in R now support case weights which seem to be a good implementation.
“Imbalanced-learn: regrets and onwards - with Guillaume Lemaitre, core-maintainer [of scikit-learn and imbalanced-learn]”, https://youtu.be/npSkuNcm-Og
Is there any reason not to always utilize sample weights for classification when they're available?
Initially, it seems that either results will improve for unbalanced data, or in the ideal scenario where your data is perfectly balanced, nothing would change. However, I'm intrigued why this isn't the default choice in machine learning libraries.
Do you have good recommendations for where we can find code implementations (Python and/or R) where cost sensitive ML and threshold tuning on validation data are implemented? Especially in the case of the latter I haven't found an end to end walkthrough, although my search may just be ineffectual. For weighting data points, I know that tidymodels in R now support case weights which seem to be a good implementation.
Great post as usual.
Junaid, Re: "code implementations where cost sensitive ML and threshold tuning on validation data are implemented"
These resources at scikit-learn.org mightbe helpful:
"Post-tuning the decision threshold for cost-sensitive learning",
webpage, https://scikit-learn.org/stable/auto_examples/model_selection/plot_cost_sensitive_learning.html
notebook, https://scikit-learn.org/stable/_downloads/133f2198d3ab792c75b39a63b0a99872/plot_cost_sensitive_learning.ipynb
Here are related videos:
"Get the best from your scikit-learn classifier: trusted probabilities and optimal binary decision", https://pretalx.com/euroscipy-2023/talk/GYYTCH/
https://www.youtube.com/watch?v=6YnhoCfArQo
“Imbalanced-learn: regrets and onwards - with Guillaume Lemaitre, core-maintainer [of scikit-learn and imbalanced-learn]”, https://youtu.be/npSkuNcm-Og
Is there any reason not to always utilize sample weights for classification when they're available?
Initially, it seems that either results will improve for unbalanced data, or in the ideal scenario where your data is perfectly balanced, nothing would change. However, I'm intrigued why this isn't the default choice in machine learning libraries.