What's the bigger difference?

A jump from 80.0% to 90.0%?

Or from 98.0% to 99.9%?

Well, depends on which space you are asking the question in.

The obvious answer would be that the first jump is larger, because it's a difference of 0.10, while the other is at 0.019.

If you thought so too, then it's because you calculated the difference in probability space, which is also the space in which I presented the question.

But you can also answer the question in log odds or logit space.

The logit of a probability is: logit(p) = log(p/(1-p)). We also call this the log odds. You’ll find the term “log odds” more often in classic statistics and “logits” more often in machine learning, at least that’s my impression so far. And in the logit space, the second jump is larger! Because in logit space we have:

log(0.9/0.1) - log(0.8/0.2) ≈ 0.8

log(0.999/0.001) - log(0.98/0.02) ≈ 3

The second jump (0.980 -> 0.999) also remains larger if we look at the multiplicative difference instead of the additive difference:

log(0.9/0.1) / log(0.8/0.2) ≈ 1.58

and

log(0.999/0.001) / log(0.98/0.02) ≈ 1.77

But why is the second jump larger in logit space?

**When probabilities are near 0 or 1, the logit function stretches out the values. So the logit amplifies small differences in probabilities close to 0 or close to 1.**

When logit versus probability matters

Many have probably had their first encounter with logits when using logistic regression. Because for the interpretation of the coefficients, we have to use the odds, which is just p/(1-p).

The β in logistic regression can be interpreted as the additive change in the log odds when the feature value is increased by 1 unit.

I always forget the interpretation, that’s why I created a cheat sheet:

That’s just how logistic regression works because it expresses the probability through the logistic function:

To get the interpretation of β, you have to invert the logistic regression which will give you the logit function.

But you do have a choice as you can also interpret so-called marginal effects, where you force small changes in the input variables and observe the changes in the output of the model — since for logistic regression these are probabilities (at least they are between 0 and 1), the marginal effects will show how feature changes influence the model output on this level instead of the log odds.

But the distinction also matters for other interpretation techniques. When you use Shapley values on a classifier, you have the option to add a link function. Well, one is always used, but it’s the identity link f(X) = X. But if you pick the logit link for this argument, then you get the interpretation on the log scale.

And that can make sense, especially since on the logit (log odds), the feature effects are additive rather than multiplicative.

But it will also mean that feature effects that push the probabilities towards the extreme, will get more emphasis.

# When to use which

Pro log odds:

Log odds are additive, while probabilities are multiplicative. Since many explanation methods have a tendency to express the prediction as a linear sum, the log odds might be a more natural choice

In many situations, a jump at the edges, like from 0.001 to 0.01 is more important than in the middle, like 0.5 to 0.6.

Avoids extreme values of 0 and 1

Pro probabilities:

Intuitive interpretation: Probabilities represent the likelihood of an event occurring that we humans can more easily grasp than anything on the logarithmic scale

Making decisions: The output of the classifier might be used in decision-making. Here it’s often easier to work with probabilities to make a decision, calculate expected costs, etc.

Familiarity: Seems more natural to interpret, since we at least feel like we understand probabilities better than log odds. Probabilities are something we hear about every day, like the probability of winning the lottery, or the probability of getting cancer, but we usually don’t have to work with log odds in daily life