Thanks a lot for this course! Love the intuitive step-by-step explanation and the hands-on example (eg choosing a concrete dataset like the Beans dataset is great).
A little question about the procedure though. Say you take the frequency plot (which contains the predicted probas for all predictions, across all different classes). We applied the 0.999 threshold such that 95% of the data points include the true class. Now, 95% was chosen arbitrarily. Let’s say someone has the idea to just set the threshold to 1.0 to get 100% coverage. Or in other words with that procedure someone can always achieve 100% coverage by setting the threshold to 1.0, which sounds good on paper but is not actually doing anything useful. Maybe I am misunderstanding something but would you mind clarifying?
Excellent point. As you showed, it's simple to construct prediction sets that have > 1-\alpha coverage. This implies that coverage can't be everything.
Researchers who propose conformal predictors therefore evaluate at least 3 things:
- coverage: Does the predictor reach the guaranteed coverage?
- average set size: Are set sizes shorter than SOTA?
- adaptivity: How well do the predictions sets adapt to different prediction difficulties.
Often there are also guaranteed upper bounds. For example, for conformal classifier that uses the 1 - score, the coverage probability is bounded between 1-\alpha and 1 - \alpha + \frac{1}{n+1}, with n being the size of the calibration set. See p.4 in https://arxiv.org/pdf/2107.07511.pdf
Thanks for explaining. So, when I understand correctly, when we are using tools like MAPIE, it will construct prediction intervals not only based on the coverage but also the other criteria you mentioned? I think discussing the issues that come with maximizing the coverage and thereby coming up with arbitrarily big prediction intervals could be a useful future article -- or maybe the relationship between prediction interval size and coverage.
Excellent introduction on CP. Looking forwards to read more blogposts on it. Unfortunately 'conformal prediction' approach of evaluating uncertainty has been 'missing ' in popular machine learning books and tutorials although there are plenty of academic articles have been published last few years.
We ran into a similar problem few weeks ago and taking inspiration from top-3, top-5 predictions in benchmark datasets e.g ImageNet ended up doing exactly what you explained in adaptive conformal predictions.
It's a pain to set the results in production though 😅 as the UI team prefer a fixed number of outputs.
Thanks for the tutorial. It is helping me program conformal prediction from scratch in R so I can better understand how it works. I'm using the Beans data with SVM. When I set the probability to 0.8 or 0.9, I get a number of beans that do not register in any class.
For example, for a particular future sample, my 1-p values are [0.99, 0.99, 0.99, 0.73, 0.99, 0.27, 0.99] with a qhat_80 = 0.19, a qhat_95 = 0.69 and a qhat_99 = 0.90. The 95% confidence interval is [6] and the 99% confidence interval is [4, 6]; this makes sense and seems correct. But, none of the values are less than qhat_80. Therefore qhat_80 does not classify the sample at all? How should I handle such future samples, assign them to the most probable class? Or did I make a conceptual mistake in the process?
If I use an adaptive prediction set, I get cumulative p-values of [0.73, 0.99, 0.99, 0.99, 0.99, 0.99, 0.99]. Here 0.73 is greater than both qhat_80 and qhat_95. Thus, both predict [6]. The second entry is the first value greater than qhat_99. Thus the 99% confidence interval is [4, 6]. So, should the take home message be to 'always' do adaptive prediction sets?
"1000 is enough (Angelopoulos and Bates 2021). Below that, the more the better. But there’s a trade-off with the size of the training and evaluation data."
Quote from my book Introduction to Conformal Prediction With Python
Thanks for this great interesting lesson. I want to work to conformal prediction but I have one question if possible please help me to do this work. Question: Is conformal prediction beneficial for imbalanced dataset for binary classification.
You can apply conformal prediction as is also on imabalanced datasets.
The coverage guarantee is only marginal, however, and one of the classes might not have the same guarantee individually.
There's a technique called class-conditional conformal prediction which can give you the guarantee for all classes. This however requires to split the calibration data by class. And if the size of the calibration data for the minority class becomes very small, you won't get get good results for the calibration (high variance)
Shouldn’t this be “equivalent to class “probabilities” > 0.999”?
“We know that for bean uncertainties s_i below 0.999 (equivalent to class “probabilities” > 0.001) we can be sure that with a probability of 95% we have the correct class included.”
Hi Christoph, thank you for the nice materials. Anyway - I’d like to ask whether the underlining code is (or could be) published in fully reproducible manner.
For instance, I have dealt with the Bean data-set discrepancies as I actually don’t know in which exact format you have loaded it - I have then encountered issues with the Mapie library.
Great post!! One thing I'm confused about is the assumption that probabilistic predictions are themselves 'confident', i.e., a classifier that says "f(bean=navy) = 0.8" may itself be uncertain and it turns out the full prediction (with uncertainty) should be "f(bean=navy) = 0.8 [0.1, 0.9]" (a simple example of this would be creating predictive intervals from B classification trees in a random forest). Does conformal prediction still work in that case?
Good point. My understanding is that this is exactly why you want to have conformal prediction. There this type of uncertainty is factored in. While for one data point you only observe one probability outcome for a class (e.g. the 0.8), for all the data points with 0.8 if there is a wide range of coverage of the actual class, that's what you factor in by calibration with conformal prediction.
Great first post, I'm looking forward to the next few weeks. I had a question when it comes to implementing conformal prediction and generating prediction sets.
1. Is the calibration dataset the same as the test dataset when we perform a training/validation/test split when model fitting? So we would train a model in the normal way and when it comes to generating calibrated predictions and prediction sets, we use the MAPIE function to do this on the test set? I've tried to include a pipeline with the bean dataset here:
Hopefully I've generated the prediction sets at the correct stage and with the appropriate datasplit.
2. Will the point predictions from the MAPIE classifier be different to those produced by the classifier using the predict method? In my own case they're mostly the same with 1 or 2 predictions being different. The difference is more pronounced in the regression case.
I found the plots you used very helpful, maybe computing conformity scores on a single probability vector by hand would be instructive as well? But it's about balancing if it's more effort than it's worth. Thanks!
1. I've wondered that myself. I think that you can use the same dataset for testing (aka performance evaluation) and calibration, because both are seperate from model training and don't influence model choices. Your script looks fine, because you don't use the calibration data for model training. Next week I will also talk about parallels of conformal prediction to model evaluation.
2. The most likely class will be contained in the prediction sets (except when empty sets are produced, which is a bit of a special case. The bean classification is "easy" so classifiers if well trained are right most of the time (accuracy of 90% or more). So most prediction sets for a good classifier will just have one class which is the same as the predicted class. For illustration I therefore used Naive Bayes, so that the results are more illustrative.
Hi, why do we actually need the uncertainty scores s=1-f(x,y) (also more confusing since it said 1-s(x,y) below the histogram), instead of working with the equivalent "probabilities" as certainties? Is it being lazy to fit the definition of 95% quantile, rather than saying "take the top 95% of the certainties"?
It's totally possible. You can easily convert and express it as you said. Not lazy or anything.
The main reason to frame it as uncertainty score is that it looks more like the general recipe that occurs in week #2 and it's easier to see the similarities with other CP approaches.
Hi, thanks for embarking on this course. I'm happy to join and learn.
I think the problem with interpreting model scores as probabilities (paragraph that starts with "Unfortunately, we shouldn’t interpret these scores as actual probabilities") is not clear enough and would benefit from a counter example that shows why model scores are not probabilities.
typos:
"but stew seems to be the"
"The data scientist doesn’t fully the model scores"
Thanks a lot for this course! Love the intuitive step-by-step explanation and the hands-on example (eg choosing a concrete dataset like the Beans dataset is great).
A little question about the procedure though. Say you take the frequency plot (which contains the predicted probas for all predictions, across all different classes). We applied the 0.999 threshold such that 95% of the data points include the true class. Now, 95% was chosen arbitrarily. Let’s say someone has the idea to just set the threshold to 1.0 to get 100% coverage. Or in other words with that procedure someone can always achieve 100% coverage by setting the threshold to 1.0, which sounds good on paper but is not actually doing anything useful. Maybe I am misunderstanding something but would you mind clarifying?
Conformal prediction becomes meaningless at $\alpha=0$.
You've got the same happening with any other frequentist confidence interval:
95% CI for a coefficient in a linear regression model -> makes sense.
100% interval -> since the coefficient distribution is positive everywhere, the interval will be [-inf;+inf]
Excellent point. As you showed, it's simple to construct prediction sets that have > 1-\alpha coverage. This implies that coverage can't be everything.
Researchers who propose conformal predictors therefore evaluate at least 3 things:
- coverage: Does the predictor reach the guaranteed coverage?
- average set size: Are set sizes shorter than SOTA?
- adaptivity: How well do the predictions sets adapt to different prediction difficulties.
Often there are also guaranteed upper bounds. For example, for conformal classifier that uses the 1 - score, the coverage probability is bounded between 1-\alpha and 1 - \alpha + \frac{1}{n+1}, with n being the size of the calibration set. See p.4 in https://arxiv.org/pdf/2107.07511.pdf
Thanks for explaining. So, when I understand correctly, when we are using tools like MAPIE, it will construct prediction intervals not only based on the coverage but also the other criteria you mentioned? I think discussing the issues that come with maximizing the coverage and thereby coming up with arbitrarily big prediction intervals could be a useful future article -- or maybe the relationship between prediction interval size and coverage.
Excellent introduction on CP. Looking forwards to read more blogposts on it. Unfortunately 'conformal prediction' approach of evaluating uncertainty has been 'missing ' in popular machine learning books and tutorials although there are plenty of academic articles have been published last few years.
Great post6, very easy to digest and intuitive.
We ran into a similar problem few weeks ago and taking inspiration from top-3, top-5 predictions in benchmark datasets e.g ImageNet ended up doing exactly what you explained in adaptive conformal predictions.
It's a pain to set the results in production though 😅 as the UI team prefer a fixed number of outputs.
Thanks for sharing that insight. Especially interesting how UI and UQ (uncertainty quantification) requirements clash.
Did you see the "top_k" method in MAPIE? It produces same size prediction sets, at the cost of being non-adaptive.
I haven't. I'll definitely be trying it out in next few days though. 😊
Thanks for the tutorial. It is helping me program conformal prediction from scratch in R so I can better understand how it works. I'm using the Beans data with SVM. When I set the probability to 0.8 or 0.9, I get a number of beans that do not register in any class.
For example, for a particular future sample, my 1-p values are [0.99, 0.99, 0.99, 0.73, 0.99, 0.27, 0.99] with a qhat_80 = 0.19, a qhat_95 = 0.69 and a qhat_99 = 0.90. The 95% confidence interval is [6] and the 99% confidence interval is [4, 6]; this makes sense and seems correct. But, none of the values are less than qhat_80. Therefore qhat_80 does not classify the sample at all? How should I handle such future samples, assign them to the most probable class? Or did I make a conceptual mistake in the process?
If I use an adaptive prediction set, I get cumulative p-values of [0.73, 0.99, 0.99, 0.99, 0.99, 0.99, 0.99]. Here 0.73 is greater than both qhat_80 and qhat_95. Thus, both predict [6]. The second entry is the first value greater than qhat_99. Thus the 99% confidence interval is [4, 6]. So, should the take home message be to 'always' do adaptive prediction sets?
Hi Christoph,
Could you please comment on the size of the calibration dataset and how does this factor in the outputs of conformal prediction?
Many thanks for the course!
Rashid
"1000 is enough (Angelopoulos and Bates 2021). Below that, the more the better. But there’s a trade-off with the size of the training and evaluation data."
Quote from my book Introduction to Conformal Prediction With Python
Hi Christoph Molnar,
Thanks for this great interesting lesson. I want to work to conformal prediction but I have one question if possible please help me to do this work. Question: Is conformal prediction beneficial for imbalanced dataset for binary classification.
You can apply conformal prediction as is also on imabalanced datasets.
The coverage guarantee is only marginal, however, and one of the classes might not have the same guarantee individually.
There's a technique called class-conditional conformal prediction which can give you the guarantee for all classes. This however requires to split the calibration data by class. And if the size of the calibration data for the minority class becomes very small, you won't get get good results for the calibration (high variance)
Thanks sir, can you give me a idea from where I will get the tutorial of class-conditional conformal prediction.
Shouldn’t this be “equivalent to class “probabilities” > 0.999”?
“We know that for bean uncertainties s_i below 0.999 (equivalent to class “probabilities” > 0.001) we can be sure that with a probability of 95% we have the correct class included.”
Hi Christoph, thank you for the nice materials. Anyway - I’d like to ask whether the underlining code is (or could be) published in fully reproducible manner.
For instance, I have dealt with the Bean data-set discrepancies as I actually don’t know in which exact format you have loaded it - I have then encountered issues with the Mapie library.
Thanks,
Štěpán
Hey Štěpán, I will publish a book that builds on the course and containts fully reproducible code examples.
Hi Christoph,
Great post!! One thing I'm confused about is the assumption that probabilistic predictions are themselves 'confident', i.e., a classifier that says "f(bean=navy) = 0.8" may itself be uncertain and it turns out the full prediction (with uncertainty) should be "f(bean=navy) = 0.8 [0.1, 0.9]" (a simple example of this would be creating predictive intervals from B classification trees in a random forest). Does conformal prediction still work in that case?
Good point. My understanding is that this is exactly why you want to have conformal prediction. There this type of uncertainty is factored in. While for one data point you only observe one probability outcome for a class (e.g. the 0.8), for all the data points with 0.8 if there is a wide range of coverage of the actual class, that's what you factor in by calibration with conformal prediction.
Hi Christoph,
Great first post, I'm looking forward to the next few weeks. I had a question when it comes to implementing conformal prediction and generating prediction sets.
1. Is the calibration dataset the same as the test dataset when we perform a training/validation/test split when model fitting? So we would train a model in the normal way and when it comes to generating calibrated predictions and prediction sets, we use the MAPIE function to do this on the test set? I've tried to include a pipeline with the bean dataset here:
https://github.com/JunaidMB/conformal_prediction_examples/blob/master/drybeans_multiclass_randomforest_conformal.py
Hopefully I've generated the prediction sets at the correct stage and with the appropriate datasplit.
2. Will the point predictions from the MAPIE classifier be different to those produced by the classifier using the predict method? In my own case they're mostly the same with 1 or 2 predictions being different. The difference is more pronounced in the regression case.
I found the plots you used very helpful, maybe computing conformity scores on a single probability vector by hand would be instructive as well? But it's about balancing if it's more effort than it's worth. Thanks!
Hey Junaid
1. I've wondered that myself. I think that you can use the same dataset for testing (aka performance evaluation) and calibration, because both are seperate from model training and don't influence model choices. Your script looks fine, because you don't use the calibration data for model training. Next week I will also talk about parallels of conformal prediction to model evaluation.
2. The most likely class will be contained in the prediction sets (except when empty sets are produced, which is a bit of a special case. The bean classification is "easy" so classifiers if well trained are right most of the time (accuracy of 90% or more). So most prediction sets for a good classifier will just have one class which is the same as the predicted class. For illustration I therefore used Naive Bayes, so that the results are more illustrative.
Hi, why do we actually need the uncertainty scores s=1-f(x,y) (also more confusing since it said 1-s(x,y) below the histogram), instead of working with the equivalent "probabilities" as certainties? Is it being lazy to fit the definition of 95% quantile, rather than saying "take the top 95% of the certainties"?
It's totally possible. You can easily convert and express it as you said. Not lazy or anything.
The main reason to frame it as uncertainty score is that it looks more like the general recipe that occurs in week #2 and it's easier to see the similarities with other CP approaches.
Hi, thanks for embarking on this course. I'm happy to join and learn.
I think the problem with interpreting model scores as probabilities (paragraph that starts with "Unfortunately, we shouldn’t interpret these scores as actual probabilities") is not clear enough and would benefit from a counter example that shows why model scores are not probabilities.
typos:
"but stew seems to be the"
"The data scientist doesn’t fully the model scores"
Thank you,
Yonatan.
Thanks, typo is corrected.
And thanks also for the feedback about why we can't trust probabilities. It does come up quite late in the post with the beans example.