How I made peace with quantile regression

Understanding quantile regression through the lens of loss optimization

Jan 16, 2024

I remember when I first learned about quantile regression. I hated it. I couldn't wrap my head around it for years. The worst part is that I can’t articulate why I had such a hard time.

But today quantile regression and I are at peace and I’ll tell you how we came to this peace agreement.

Quantile regression is typically motivated by distributions: In the great journey of becoming a statistician, you first learn about linear regression and how it models the conditional mean of a distribution. Later you learn about quantile regression which doesn’t target the mean but other locations in that distribution. But you see, I’m already drifting into the statisticians’ narrative of quantile regression, which even many machine learning intros to quantile regression use. Forget all of the distribution stuff for now. Because quantile regression is simple to understand if you view it from a different modeling mindset.

From a machine learning perspective, quantile regression is absurdly simple: It's regression with a specific loss function called the pinball loss. That's it.

This framing opens a path to understanding quantile regression: Let’s study the pinball loss.

Understanding the pinball loss

To get your model to do quantile regression, it should optimize the pinball loss:

\( L_{\tau}(y, f(x)) = \begin{cases} \tau \cdot |y - f(x)| & \text{if } y \geq f(x) \\ (1 - \tau) \cdot |y - f(x)| & \text{if } y < f(x) \end{cases}\)

Let’s take this loss function apart to demystify quantile regression.

At the loss functions’s core, we find the absolute difference abs(y - f(x) between the ground truth and the model’s prediction. If we ignore the parameter 𝜏 and just use the absolute differences, it’s the same as optimizing the L1 loss which is often chosen for training models that are robust against outliers. Like LASSO. Why do absolute differences protect against outliers? Many losses use squared differences between target and prediction where outliers can have an immense effect. But with absolute differences, there’s only a linear increase for extreme mismatches between target and prediction.

A model with perfect predictions has a pinball loss of zero regardless of the choice of 𝜏. It’s not surprising since the loss function uses absolute differences. But for some reason emphasizing this fact helped me better grasp quantile regression.

The pinball loss is asymmetric: The loss attaches different costs to underpredicting and overpredicting the target. The absolute differences are either multiplied with 𝜏 or 1 - 𝜏. The parameter 𝜏 is picked by the modeler and should be a number between 0 and 1 and steers how expensive certain mistakes are. Values of 𝜏 < 0.5 make overprediction more expensive and therefore the model is incentivized to underpredict. If 𝜏 > 0.5, then the model has the incentive to overpredict the target. The further the parameter 𝜏 is from 0.5, the stronger the incentives for under- or over-prediction.

The asymmetry is best understood with an example. Here I picked 𝜏 = 0.1 and the ground truth is y = 103.

The pinball loss is zero if we predict f(x) = 103 perfectly
If the model underpredicts by 6 with f(x) = 97, the loss is 0.1 (103 - 97) = 0.6
If the model overpredicts by 6 with f(x) = 109, the loss is (1 - 0.1) (109 - 103) = 5.4

The effect of 𝜏=0.1 on model training is to bias the model toward underpredicting the target since overpredicting is 9 times more expensive than underpredicting when the absolute differences are the same.

Summarizing these insights, the pinball loss is a cost-sensitive loss function based on absolute differences. As a consequence, quantile regression is a regression task that favors robust models and we can steer how strongly the model should under- or overpredict with parameter 𝜏.

What the pinball loss tells us about quantiles

The asymmetry parameter 𝜏 dictates how often we should expect the target to be under- or overpredicted.

For 𝜏=0.1, we would expect the model to underpredict 90% of the time and overpredict 10% of the time. And now we have already bridged the gap to distributions: This is equivalent to the 10% quantile.

To visualize how pinball loss and quantiles relate to each other, I visualized 10 data points and the simplest model possible: a constant value. Picking 𝜏=0.1, what would be the best constant model for these 10 data points? The following figure visualizes the mean pinball loss across the 10 data points for different constant models:

A plot. On the x-axis are the ground truth values, on the y-axis is the mean pinball loss. 10 dots are plotted along the x-axis. the pinball loss is lowest between points 1 and 2. — The mean pinball loss for 10 data points when picking a constant model.

The pinball loss is lowest between the 1st and 2nd data points. So the best constant model(s) is any value between the first and second points, for example, -2.3. Here we can see quantile regression in action: Picking 𝜏=0.1 finds the spot where 10% (here 1/10) of the data are lower and 90% (here 9/10) are larger.

Of course, this visualization is simplified because quantile regression targets the conditional distribution and the “best constant quantile regression model” is equivalent to quantiles of the marginal distribution. However, the principle of under- and overprediction remains the same for the conditional distribution.

For completeness, here’s what the mean pinball loss for the 50% quantile looks like:

And here’s 90%:

The reason I revisited quantile regression was that I currently participate in a machine learning challenge that requires estimating 10%, 50%, and 90% quantiles of water supply. Next week I’ll share a few tips and caveats when it comes to estimating quantile models.