A Pragmatic View of Uncertainty in Machine Learning
Untangling aleatoric and epistemic uncertainty
Today’s post is based on the paper Sources of Uncertainty in Machine Learning -- A Statisticians' View.
Aleatoric versus epistemic uncertainty
Machine learning models make predictions. But these are not always correct. There’s uncertainty in the predictions (not always quantified though) which makes them deviate from the ground truth.
A typical way to talk about uncertainty is by distinguishing aleatoric and epistemic uncertainty.
Imagine you have a system. The system has some internal rules that tell it how to produce an output based on some inputs. But before the output is returned, a random number generator adds a little bit of noise on top. You can train a machine learning model to try to reproduce the system’s output.
If the model learns the system’s rules perfectly, the predictions will still deviate from the actual outputs. That’s the aleatoric uncertainty or irreducible error. And it comes from the random noise that was added to the output.
But the model might also fail to learn the rules of the system perfectly, which makes the predictions further deviate from the actual output. This is epistemic uncertainty. Also called reducible error, since, in theory, you can reduce it by training a better model.
So aleatoric uncertainty has something to do with true randomness and epistemic uncertainty with shortcoming of the model. Got it.
A dice roll is aleatoric, isn’t it?
While I neatly separated aleatoric and epistemic uncertainty, it’s not so easy in reality.
When you roll dice, the outcome is usually considered aleatoric: all the uncertainty is due to randomness which makes it an irreducible error. In other words, there is no model that can capture the mechanics, right? The word “aleatoric” even comes from “alea”, the Latin word for dice.
But you can make the case that dice rolls are not aleatoric:
“Although rolling a fair dice is commonly used as an example of pure randomness and thus aleatoric uncertainty, it can also be seen as a purely physical process. Knowing the initial position, each rotation and movement of the dice, it is possible to predict exactly which number will be rolled.”
So what we think of as “random” is actually deterministic, it’s just too much effort to measure and compute all the physics behind a dice roll.
But this line of thinking moves the needle too far towards epistemic uncertainty. Because then everything would be epistemic uncertainty, maybe except for some stuff happening on the quantum level. Discussions would start about whether the universe is deterministic.
This is not a productive definition of aleatoric and epistemic uncertainty.
A pragmatic way to distinguish aleatoric and epistemic uncertainty
A way out of this dilemma is provided by Gruber et. al (2023), based on probability theory.
This requires to first define the output (Y) and the input features (X). Then you can define the aleatoric uncertainty as the probabilistic relationship between X and Y. Mathematically, this is simply described by the conditional probability distribution:
This model/probability captures all of the aleatoric uncertainty, which we can quantify as conditional variance Var(Y|X=x).1 Any remaining uncertainty is epistemic with this definition.
The trick lies in the conditioning on X. For the dice roll, if you condition on all the physical features, meaning they are part of X, then there would be little to no aleatoric uncertainty because Var(Y|X=x) would be low.
But if the dice roll is modeled without the physical features, maybe with no features at all, then we get P(Y) with Var(Y) being the usual variance of dice rolls with the usual aleatoric uncertainty. And we can still use it as a random number generator when we play games.
Which uncertainty does conformal prediction quantify?
Techniques such as conformal prediction quantify uncertainty. But which one do they capture? Aleatoric or epistemic?
Conformal prediction is a method that turns predictions into prediction intervals that come with formal guarantees of covering the true outcome. The method works for any machine learning model since it works by calibration. The calibration step is to determine how likely a new prediction is to be correct, generating a range of potential outcomes.
If you want to dive deeper into conformal prediction, check out my book Introduction To Conformal Prediction With Python or check out my free course.
Conformal prediction captures both aleatoric and epistemic uncertainty.
For example, an image classifier might classify a photo as a “cat” with a score of 30%. But we can’t trust the score since classifiers are often not calibrated.
Using conformal prediction, we might get a prediction interval containing {cat, dog, mouse, lion}. Let’s say it’s a 95% prediction interval, meaning that, in expectation, it covers the true class with a probability of 95%. The more classes the interval contains, the less certain the model was. The interval is driven by both aleatoric and epistemic uncertainty. For real data, there is no way for us to say what the interval would look like for only aleatoric uncertainty.
In fact, that’s a general problem in machine learning: we can’t distinguish between aleatoric and epistemic uncertainty in real data.
Short note: We usually don’t have P(Y|X=x) so it’s a theoretical model. So we can only quantify it for simulated or theoretical cases.
Insightful post! One question I had was why quantifying aleatoric uncertainty vía variance instead of, e.g., entropy? Is there a principled argument or is it just a matter of modeling preference?
Great post! I have a question, in a simulated dataset, where one has more control the data generating process, if one particularly leverages more control over which features have the most contribution towards aleatoric uncertainty, then is it also fair to say that those features have aleatoric uncertainty as we are controlling the uncertainty there and the possible overlap between classes?