Last week’s post was about aleatoric and epistemic uncertainty. Recap:
Sources of uncertainty in machine learning models can be categorized as aleatoric (inherent noise in the system) and epistemic (flaws in the model).
However, this view of uncertainty is very model-centric and neglects other sources of uncertainty: data uncertainty and deployment uncertainty, which we discuss in this post. This post again was inspired by this paper by Gruber et al (2023).
Data Uncertainty
When we talk about aleatoric and epistemic uncertainty of the model, we often assume that the data is a fixed and not source of uncertainty. Everyone who has collected and worked with real data knows that they are far from being “certain”.
Raw numbers may seem like objective truths. However, even a simple measurement of the length of a screw can be subject to uncertainty. Because it might be measured with some error, for example, introduced by rounding. Or it’s measured by two people and one person includes the head of the screw for the length measurement, but the other doesn't.
Data has many sources of uncertainty:
Unobserved variables
Omitted variables
Errors in X
Errors in Y
Missing data
These errors can be difficult to quantify, especially the less you know about the data-collection process. If you just download a Kaggle dataset that’s not well documented, then there’s little chance that you will learn how uncertain the data are.
But it can help to keep these error sources in mind, and sometimes you can even reduce them. That is if you have access to the collection process. Sometimes it’s possible to hunt down missing values, include omitted variables, and quantify measurement errors.
Deployment Uncertainty
Machine learning models are typically deployed in an application where it’s fed with new data to produce new predictions.
And this step too holds uncertainty. Uncertainty about the new data. Not only can it have similar problems as the training data (measurement errors, missing data, etc.), but the uncertainty due to distribution shift always looms.
When a distribution shift happens, then the new data differs from the training data. This can be a change in the distribution of X, in Y, or in the relation between both.
Potentially this worsens the performance of your model, and it’s a very common occurrence. You can monitor for distribution shifts, but they can be hard to detect. And it usually requires knowing the ground truth for Y, but in many applications, you never get the Y for free.
The deployment uncertainty of the model’s predictions rises with each day when the new data potentially has deviated from the training data.
Even if you quantified uncertainty before, even with a calibrated approach such as conformal prediction, the prediction might gain additional uncertainty that is not reflected in your current numbers.
Summary: Think beyond just aleatoric and epistemic uncertainty — think of data and deployment uncertainty as well.
For a deeper dive, have a look at the paper that inspired this thread.