Keeping calibration and test sets different is usually a good idea, as otherwise the model will be peeking into the test set during the calibration stage. In the research papers we have always kept them separate.
Thanks for this series. I am just following it now after subscribing to your newsletter.
I loved your IML book and was quite pleased that it was done in R, which I prefer to Python. Just totally off the cuff: I would like your personal comments on why you've switched from R to Python.
I use both R and Python and preferred to do a project in Python again. And I thought I would reach more readers as Python seems more dominant in machine learning https://www.kaggle.com/kaggle-survey-2022
Thanks for the response. Oh well. I am sorry that I have to AGREE with you--although I love R (specifically, Tidyverse R) and prefer it to Python, I certainly recognize that Python is far more popular, so it is indeed the better choice to reach more readers.
In regards to which data to use for calibration, it seems to depend on how the conformal prediction results will factor into ultimate decision-making. If it's ultimately viewed as a "nice to have" measure of prediction uncertainty, then perhaps it's not a big deal. But if we want rigor behind assuring the bean company CEO that our [conformal sets which include only two levels] have the correct value 95% of the time, it would seem we would want to test the robustness of the q threshold on truly unseen data?
If the test set is used to evaluate the uncalibrated model outputs, it might be ok to have the double use for the calibration/test data, as I pondered in the post.
To evaluate the prediction sets (aka the calibrated or conformalized model) then you definitely need separate calibration and training data.
I have a comment regarding using test data for calibration.
I'll try to elaborate based on conformal regression because I know that a bit better :---)
In the case of conformal regression, when we use quantile regression we may choose to determine a conformal correction to our predicted quantiles. This is determined from some dataset where we have known labels, but which the training algorithm has not seen, so as not to risk overfitting.
When we would use the test set to do calibration, we would a) use the test set as if we knew the labels (which we don't in reality). And b) when we then compute a test score with the computed conformal quantile correction we would now leak information from the test labels into the scoring.
I wonder if it is hence better to just split our validation dataset into a validation and a calibration dataset. With a test dataset held out to the very last step. If we tune hyperparameters we also can't use the validation set for calibration since we'd be risking overfitting to the validation dataset, and hence reusing the validation dataset in that case for calibration would reuse already "used" data.
If we don't care about scoring e.g. after retraining on all training data, then we can just use all test data for calibration and go into production.
If test and calibration are the same data, you will run into troubles if the performance evaluation is for the conformalized/calibrated model we run into troubles as you pointed out.
I think if they can be the same, then only in the case that the performance evaluation only concerns the uncalibrated model outputs.
If we want to test the calibrated outputs, then calibration would have to be split of from training or validation set. Or you could frame it as a 4-way split into training, validation, calibration and testing.
If we don't require a final performance evaluation, I agree you could use the "test data" for calibration.
Thanks a lot for this! Do you know if there is an implementation somewhere of *group* Kfold? I need this in my use case as data points within groups can be very similar. If not, I'll have to dig in how to combine the output of multiple folds...
Keeping calibration and test sets different is usually a good idea, as otherwise the model will be peeking into the test set during the calibration stage. In the research papers we have always kept them separate.
Thanks for the input Valeriy. I agree that you are on the safe side if you keep them separated.
Thanks Christoph for this series!.
Do you plan to use R also?.
In R, there are several packges for conformal analysis most of them for regression.
And just one for classification (conformalClassification), but for "randomForest" models.
Thanks,
Carlos.
Hey Carlos, for now I'm focusing on Python.
Could you recommend an R library for conformal prediction that is comparable with Python's MAPIE?
Thanks for this series. I am just following it now after subscribing to your newsletter.
I loved your IML book and was quite pleased that it was done in R, which I prefer to Python. Just totally off the cuff: I would like your personal comments on why you've switched from R to Python.
That's a good question.
I use both R and Python and preferred to do a project in Python again. And I thought I would reach more readers as Python seems more dominant in machine learning https://www.kaggle.com/kaggle-survey-2022
Thanks for the response. Oh well. I am sorry that I have to AGREE with you--although I love R (specifically, Tidyverse R) and prefer it to Python, I certainly recognize that Python is far more popular, so it is indeed the better choice to reach more readers.
Thanks for this series; I'm learning a lot!
In regards to which data to use for calibration, it seems to depend on how the conformal prediction results will factor into ultimate decision-making. If it's ultimately viewed as a "nice to have" measure of prediction uncertainty, then perhaps it's not a big deal. But if we want rigor behind assuring the bean company CEO that our [conformal sets which include only two levels] have the correct value 95% of the time, it would seem we would want to test the robustness of the q threshold on truly unseen data?
That's my hunch as well.
If the test set is used to evaluate the uncalibrated model outputs, it might be ok to have the double use for the calibration/test data, as I pondered in the post.
To evaluate the prediction sets (aka the calibrated or conformalized model) then you definitely need separate calibration and training data.
Great article, Christoph!
I have a comment regarding using test data for calibration.
I'll try to elaborate based on conformal regression because I know that a bit better :---)
In the case of conformal regression, when we use quantile regression we may choose to determine a conformal correction to our predicted quantiles. This is determined from some dataset where we have known labels, but which the training algorithm has not seen, so as not to risk overfitting.
When we would use the test set to do calibration, we would a) use the test set as if we knew the labels (which we don't in reality). And b) when we then compute a test score with the computed conformal quantile correction we would now leak information from the test labels into the scoring.
I wonder if it is hence better to just split our validation dataset into a validation and a calibration dataset. With a test dataset held out to the very last step. If we tune hyperparameters we also can't use the validation set for calibration since we'd be risking overfitting to the validation dataset, and hence reusing the validation dataset in that case for calibration would reuse already "used" data.
If we don't care about scoring e.g. after retraining on all training data, then we can just use all test data for calibration and go into production.
What do you think?
If test and calibration are the same data, you will run into troubles if the performance evaluation is for the conformalized/calibrated model we run into troubles as you pointed out.
I think if they can be the same, then only in the case that the performance evaluation only concerns the uncalibrated model outputs.
If we want to test the calibrated outputs, then calibration would have to be split of from training or validation set. Or you could frame it as a 4-way split into training, validation, calibration and testing.
If we don't require a final performance evaluation, I agree you could use the "test data" for calibration.
Thanks a lot for this! Do you know if there is an implementation somewhere of *group* Kfold? I need this in my use case as data points within groups can be very similar. If not, I'll have to dig in how to combine the output of multiple folds...