Understanding Different Uncertainty Mindsets
How conformal predicition relates to bootstrapping, Bayesian inference and frequentist probability
The course on conformal prediction has ended and the newsletter has returned to the usual mode. If you missed out on the course, you can find all the materials here.
One of the most common questions I got about conformal prediction was:
How does conformal prediction relate to Bayesian posteriors or bootstrapping?
This week we take a higher-level view on uncertainty quantification. How does conformal prediction fit in with other approaches?
We’ll start with two competing interpretations of probability: Bayesian versus frequentist.
Bayesian Probability
In my book Modeling Mindsets, I described Bayesian modeling like this:
Bayesian models start from the premise that the model parameters are random variables. Therefore, the modeling goal has to be the parameter distribution given the data. But there is a problem. It’s unclear how parameters given data are distributed. The distribution of data given parameters is more natural to estimate. Fortunately, a mathematical “trick” helps: Bayes’ theorem.
The theorem inverses the condition to P(data | parameters), which is the good old likelihood function. Bayes’ theorem also involves P(parameters), the priori distribution.
That’s why Bayesians must specify a parameter distribution before observing data.
Model estimation equals an update from priori to posterior.
Conformal prediction, however, describes the uncertainty of the model prediction, and not of the model parameters.
For Bayesian models, we can derive the uncertainty of the model predictions from the model parameter distributions.
To predict, the Bayesian must simulate. Because if parameters are random variables, so are the predictions. Prediction with a Bayesian model means: sample parameters from the posteriors, then make predictions with the model. Repeat that multiple times and get a distribution of the prediction.
This gives us the posterior predictive distribution, so that the prediction is not a point prediction, but the entire distribution for that prediction. This makes Bayesian inference a holistic approach to uncertainty quantification. Uncertainty from priors and from the data propagate through the model to the predictive uncertainty.
The downside of Bayesian being a holistic approach: You don’t get the posterior predictive distributions for free, you have to actually do all the parts of Bayesian modeling, including deciding on priors and restricting your modeling to the class of Bayesian models.
In this regard, conformal prediction is very different, as it post-processes the model outputs. That means with conformal prediction, we can post-process even Bayesian posterior predictive distributions.
Another difference: conformal prediction produces prediction regions. At least in most cases. These predictive regions come with a coverage guarantee (which Bayesian models usually don’t have) that says that, on average, 1 - ɑ of the true outcomes are covered by these regions. This interpretation is very similar to the interpretation of frequentist confidence intervals.
And in fact, conformal prediction regions are based on a frequentist interpretation of probability.
Conformal Prediction Has A Frequentist Interpretation
The frequentist interpretation of probability:
Probability is seen as the relative frequency of an event in infinitely repeated trials.
This interpretation is a bit clunky. When doing statistical consulting as a student, I found that many people actually prefer the Bayesian interpretation, even though they only used frequentist stats.
The confidence interval is a typical frequentist product. Here is its interpretation (excerpt from Modeling Mindsets):
The confidence interval is the region that likely covers the true parameter, which is assumed to be fixed but unknown. But it would be wrong to say: There is a 95% chance that the true parameter falls into this interval. A big No-No. If the modeler wants this type of interpretation, they better knock on the Bayesians' door. But for the frequentist, the experiment is done. Either the parameter is in the resulting interval, or it isn’t. Instead, the frequentist interprets the 95% confidence interval in terms of long-run frequencies: If the modeler repeated the experiment many times, 95% of the time, the corresponding confidence interval would cover the true parameter.
Prediction sets and intervals produced by conformal prediction also have a frequentist interpretation: Of these sets/intervals 1 - ɑ cover the true outcome.
Next up: how does conformal prediction relate to bootstrapping?
Bootstrapping
The other approach people wanted a comparison with was bootstrapping. Bootstrapping is similar to conformal prediction in the sense that it can be applied to any model:
draw multiple samples from the data (with replacement)
on each sample, fit a model
for a data point, study how much the prediction varies
And to get bootstrapped prediction interval, we can do something very similar to conformal prediction: For a new data point, get the full bootstrap distribution, and compute the 90% quantile. For example, if you have 100 model refits, remove the lowest 5 and the highest 5, then you have the 90% interval.
The problem: Again, there’s no guarantee that this interval really holds 90% of the data. Bootstrapping underestimates the variance because each of the refitted models shares a certain percentage of data with the other models.
And yet again, you can use conformal prediction on top of bootstrapping results to conformalize the bootstrapped intervals.
Conformal Prediction
Recap: Conformal prediction takes a model and produces prediction regions instead of point predictions. It turns a heuristic notion of uncertainty into one with guaranteed coverage (e.g. 90%).
To cite my book on conformal prediction (coming soon) on the interpretation of these prediction regions:
The prediction regions are random variables that have a frequentist interpretation. Meaning they follow a frequentist interpretation of probability, rather similar to confidence intervals of coefficients in linear regression models. Since the true value is seen as fixed, but the prediction region is the random variable, it would be wrong to say that the true value “falls” into the interval. Because the true value is fixed but unknown. Also, if we have just one interval, we can’t make probabilistic statements, since it’s one realization of the prediction region random variable. And either it covers the true value or it doesn’t, not subject to probability. Very nitpicky, I know, but that’s how it is. Instead we can only speak about the average behavior of these prediction regions in the long-run, e.g. how the region “variable” behaves in repeated observations.
Another mindset I write about in Modeling Mindsets is supervised machine learning. In supervised machine learning, there’s a strong focus on evaluation. The evaluation has to happen out-of-sample, meaning a separation between training data and evaluation data. It also requires that we have the ground truth for the evaluation data. That’s very similar to conformal prediction which requires a calibration data set where we also have the ground truth available.
I like to think of CP as partially being a supervised learning mindset to uncertainty quantification. This is, of course, a simplification since the true motivation behind requiring separate calibration data requires math and statistical theory. But the parallels to model evaluation in supervised learning are there and understanding the supervised learning mindset helps to understand conformal prediction.
That’s also why conformal prediction feels natural to me: I studied statistics for 5 years, which was mostly frequentist modeling. After that I switched to machine learning, mostly supervised learning. A perfect home for conformal prediction.
As we have seen, conformal prediction can be combined with Bayesian models and bootstrapping. CP is not an alternative uncertainty method but can be used as a post-processing method for other uncertainty quantification approaches. In fact, conformal prediction needs some notion of uncertainty to work.
To summarize:
Bayesian versus frequentist interpretation of probability is fundamental to understanding uncertainty quantification
Prediction regions produced by conformal prediction have a frequentist interpretation
Conformal prediction shares ideas with supervised learning, especially evaluation and requirements for ground truth and out-of-sample data
Bayesian predictive posteriors and bootstrapping are not in competition with conformal prediction. Conformal prediction can post-process other uncertainty outputs
Benefits Of Modeling Mindsets
When a paradigm such as conformal prediction comes along, it can be hard to “integrate” it with other approaches that you already know. At least for me, that was one of the reasons why I didn’t immediately get the idea of conformal prediction.
I’ve gotten much better at understanding new methods by knowing the different “archetypes” or mindsets of modeling. And I’ve put all this high-level knowledge into the book modeling mindsets. Without the math and details, just get the gist of Bayesian inference, unsupervised machine learning, causal inference, and all the other modeling mindsets.
Getting an overview makes it much easier to understand and categorize new approaches. If you haven’t read Modeling Mindsets yet, check it out: