Keeping calibration and test sets different is usually a good idea, as otherwise the model will be peeking into the test set during the calibration stage. In the research papers we have always kept them separate.

Expand full comment
Dec 28, 2022Liked by Christoph Molnar

Thanks Christoph for this series!.

Do you plan to use R also?.

In R, there are several packges for conformal analysis most of them for regression.

And just one for classification (conformalClassification), but for "randomForest" models.



Expand full comment

Thanks for this series; I'm learning a lot!

In regards to which data to use for calibration, it seems to depend on how the conformal prediction results will factor into ultimate decision-making. If it's ultimately viewed as a "nice to have" measure of prediction uncertainty, then perhaps it's not a big deal. But if we want rigor behind assuring the bean company CEO that our [conformal sets which include only two levels] have the correct value 95% of the time, it would seem we would want to test the robustness of the q threshold on truly unseen data?

Expand full comment

Great article, Christoph!

I have a comment regarding using test data for calibration.

I'll try to elaborate based on conformal regression because I know that a bit better :---)

In the case of conformal regression, when we use quantile regression we may choose to determine a conformal correction to our predicted quantiles. This is determined from some dataset where we have known labels, but which the training algorithm has not seen, so as not to risk overfitting.

When we would use the test set to do calibration, we would a) use the test set as if we knew the labels (which we don't in reality). And b) when we then compute a test score with the computed conformal quantile correction we would now leak information from the test labels into the scoring.

I wonder if it is hence better to just split our validation dataset into a validation and a calibration dataset. With a test dataset held out to the very last step. If we tune hyperparameters we also can't use the validation set for calibration since we'd be risking overfitting to the validation dataset, and hence reusing the validation dataset in that case for calibration would reuse already "used" data.

If we don't care about scoring e.g. after retraining on all training data, then we can just use all test data for calibration and go into production.

What do you think?

Expand full comment

Thanks a lot for this! Do you know if there is an implementation somewhere of *group* Kfold? I need this in my use case as data points within groups can be very similar. If not, I'll have to dig in how to combine the output of multiple folds...

Expand full comment