How to get from evaluation to final model

Mar 19, 2024

We’ve all been there: You’ve set up a machine learning pipeline with tuning, model selection, and evaluation.

4 Comments

Mar 19, 2024

In Neural Networks, weight averaging the five models can sometimes work pretty well as an alternative to ensembling (especially if you use techniques like git rebasin).

Keeping a held-out test set to pick the best method among the ones you presented can also help.

Expand full comment

Tripartio

Mar 21, 2024

Concerning the "The F*-It Approach": I was very unpleasantly surprised to see you toss profanity into your post. I really love your blog but I would respectfully ask you to please keep the language professional.

Expand full comment

Tripartio

Mar 21, 2024

Please excuse the lengthy comment, but this is a rather important issue.

I have thought a lot about this issue and I am now convinced that what you call "parameter donation" is the most appropriate approach, though your description of it does not make it clear why. So, let me reframe the considerations differently.

Fundamentally, I believe that most of the confusion lies in the conflicting ways that we use the term "model" without realizing it. There are at least three distinct meanings when we use the term "model" in ML:

* A model is an algorithm for how to map predictors to a prediction of the target outcome e.g., linear regression, neural network, decision tree, etc. Rather than "model", it is clearer to call this an ML "algorithm".

* A model is a specific set of hyperparameters used to configure an ML algorithm, regardless of the dataset. For example, a random forest with 200 trees of max depth 5 would be such a "model". But it would be clearer to call this a "hyperparameter configuration".

* A model is a specific mapping of a specific hyperparameter configuration trained on a specific dataset. From this perspective, if any hyperparameter changes or even one line of data is different, then you have a different model. In my view, this is the most consistent use of the term "model". However, to clearly distinguish it from the other confusing uses, this particular meaning could be called a "trained model".

A second issue of confusion is that the model development and model deployment stages have two distinct, albeit related, goals. The goal of model development is NOT to produce a trained model. Its purpose is to produce the best performing hyperparameter configuration, regardless of what data that configuration might be trained on. The goal of model deployment is to put into production the best performing trained model as a tool that can be used to predict future data.

From this perspective, if you accept my definition of a trained model as hyperparameters + dataset, then the best trained model for deployment should be the best hyperparameter configuration trained on the best quality data that we have. When framed this way, then the best hyperparameter configuration is the specific ML algorithm with its specific set of hyperparameters that the model development process you described has yielded. Then the best quality data that we have is very simply the full dataset without any splits--using all the data available is necessarily the best quality. (Of course, I take for granted that any needed data cleansing has been carried out.) So, the best deployment model is clearly the trained model based on the hyperparameters from the development stage trained on the full dataset. This is what you call the "parameter donation" approach. (I don't really like that term, though. I prefer to call it something like "full model training", where "full" mainly refers to the full dataset without any splits.)

The primary objection to full model training probably arises from yet another confusion--the confusion between model evaluation and model deployment. These are two different steps with different goals. The goal of model evaluation is to estimate the performance of a trained model. To avoid overfitting, we need to split the data, whether with simple splitting or cross-validation. That way, we can get a [hopefully] unbiased estimate of the performance of a trained model so that we can infer which hyperparameter configuration works best on the kind of data we have. But this is NOT a direct evaluation of a hyperparameter configuration. Only a model trained on a specific data subset can be directly evaluated. At best, a hyperparameter configuration is evaluated only indirectly by evaluating models trained on specific data using data hyperparameter configuration.

This brings me to the crux of the misunderstanding. We must fully realize that if we deploy the best trained model (best hyperparameters + full dataset), then at the point of deployment, that trained model is NOT evaluated at all, at least, not directly. Since we do not split the data in any way, we have no way to evaluate that trained model's performance without overfitting. So, we just don't even try (that is, not at the time of deployment). We deploy it not because we have an accurate measure of its performance but because we have an accurate measure of the performance of other trained models that used that same hyperparameter configuration. And because we are using the full dataset this time, we fully expect the true performance of this final trained model to be at least as good as that of the trained models from the development process.

As long as we can accept that we are unable to directly evaluate the performance with full model training, then there is no problem. It is altogether reasonable to deploy a trained model because we have verified that it uses the best hyperparameter configuration and we know that it is trained on the most complete data that we have. Even though we cannot directly evaluate its performance, there is no reason to expect that any other option would yield better results--we would expect any other option to perform worse.

That said, once the final model is trained with full model training, MLOps then takes over. With MLOps, we DO have very well-defined ways to evaluate the performance of this final trained model based on future data as it comes in. That is, we track the performance of the predictions over time. The deployed model should perform at least as well as the best properly evaluated trained models from the model development stage. This can be verified over time during the production phase of the trained model, but not at the time of its initial deployment. This is normal. We should accept that.

Expand full comment

Mar 20, 2024Edited

In high-stakes situations, where small performance gains might mean a lot of money, here is what I did: instead of evaluating a model, I'd evaluate the whole training-validation strategy.

Assume you are comparing inside out vs parameter donation: you can simulate how those two strategies would perform out of sample and out of time by replaying your dataset over time, like a time series validation, and applying those strategies to all the data you have until then and then testing the final model against the next period of time. You can break the time intervals by week, month, quarter or year, depending on your problem.

If parameter donation wins, then you can be confident it's the best strategy to use, even if you cannot make a meaningful claim about the test error! You can even quantify how much you expect it to be better than the other strategies in expectation. This allows you to trade-off the risk of uncertain test results with the potential performance gains of training with more or fresher data.

Expand full comment

Mindful Modeler

How to get from evaluation to final model