But have you considered writing your own evaluation metric?

The Strategic Use of Custom Metrics in Machine Learning

Nov 21, 2023

Metrics like mean squared error or F1 can be convenient.

But we should be using custom evaluation metrics much more often.

Last week I wrote about an equivalence between loss functions and distributions. 1 Today’s post goes a step further: The evaluation metric is your key to steer the model.

You invest considerable energy in training the model, shaping the data, tuning the parameters, and engineering the features. Yet, the most impactful aspect may be the performance evaluation. The tricky part is this: If you choose the wrong evaluation metric, you won't notice during training. The model's performance might appear stellar, but it's irrelevant if it optimizes the wrong thing.

This is an instance of “If I had an hour to solve a problem I'd spend 55 minutes thinking about the problem and five minutes thinking about solutions.”, which is a quote attributed to Einstein.

Off-the-shelf-metrics are not judgment-free

Evaluation metrics come with assumptions and the off-the-shelf metrics are no exception.

Let’s take the mean squared error:

\(MSE = \frac{1}{n}\sum_{i=1}^n(y_i - f(x_i))^2\)

I’d call it an off-the-shelf metric. It might seem like a “neutral” choice. But it carries lots of assumptions:

Symmetry: Both directions of errors are equally weighted.
Using MSE implies that errors are quadratically worse as they grow.
Focus on large errors.
Sensitive to outliers.
MSE is dependent on the scale of the data.
All data points are weighted equally.

Depending on your model use, these assumptions might be appropriate, but they might also be not.

In our new book, "Supervised Machine Learning for Science," Timo and I discuss how evaluation metrics can incorporate domain knowledge into the model.

A few tips to get started:

You can start with an off-the-shelf metric that suits your task, and then modify it based on domain knowledge.
For classification, cost-sensitive metrics are a flexible metrics class that allows adaptation to your use case
Make sure to communicate with domain experts. A good question to ask: How “expensive” should different types of errors be?

The closer you can match the evaluation metric to the use case, the better.

It's important to distinguish between the loss that the model directly optimizes for and the evaluation metric that represents your overall goal in making modeling decisions. Ideally, these are the same, but it's not always possible, especially as some model classes or implementations do not allow picking custom loss functions (e.g., CART).

Mindful Modeler

Discussion about this post