We’ve all been there: You’ve set up a machine learning pipeline with tuning, model selection, and evaluation. Tons of data splitting, maybe with cross-validation. You are done modeling and need a final model to deploy.
Except, it’s not clear what the “final model” should be, since you have several options on how to proceed.
Let’s say you use 5-fold cross-validation to evaluate the model. The model is a gradient-boosted classification model (such as xgboost or Catboost). In each cross-validation iteration, the training data is further split into 80% for training and 20% for validation, which is used for tuning hyperparameters with random search. For more efficient data use you could alternatively tune with an inner cross-validation, but let’s keep it simple for this example.
Let’s say that you get 84% accuracy, which is estimated by averaging the performance over the 5 folds. 84% is good enough to deploy the model. But the pragmatic question is: What exactly do you deploy now?
And there’s a more philosophical side to the question: What does the 84% accuracy refer to? Ideally, whatever model we deploy in the end, we want it to be (at least) 84% accurate, but we also want to make good use of all of our data to make the model more robust and maybe even more accurate.
Let’s go through the options:
The Inside-Out Approach: Repeat what you did inside the cross-validation, but now use the entire data. In the example, we have split the data inside the CV into 80% training and 20% validation for tuning. Then now we take all of the data and train the model with the 80/20 split. The model with the best hyperparameters (highest accuracy) gets deployed. However, this new measure of accuracy is no longer unbiased because it’s based on the same data we used to tune the model. This approach is consistent with saying that the 84% accuracy applies to the entire model tuning process. Problems with this approach: The final model may have higher variance as it depends on a single training and validation split. We didn’t use all the data for training the model, which is another drawback. Also, we have increased the data size, so it’s unclear whether the 84% still holds, but typically, it should be better.
The Parameter Donation Approach: Use all of the available data to train the final model. For hyperparameters, use the best hyperparameter configuration from within the cross-validation. Ideally, the random search used the same random seed in all folds so that each hyperparameter configuration is evaluated across all 5 folds. While this approach allows you to train the model with the most data, there’s a conceptual problem: the hyperparameter selection is based on training with only 64% of the overall data (0.8 x 0.8). Hyperparameters are about regularizing the model but the amount of regularization needed depends on the size of the training data. It’s unclear what happens with the performance and it might even get worse than the 84%.
The Ensemble Approach: The 5 models from cross-validation are already your final model. To predict a new data point, simply take the predictions from the five models and average them. This approach is consistent with saying that the evaluation was specific to these five models. Problems with this approach: Makes deployment more difficult as you now have to deploy an ensemble. It’s also unclear whether the accuracy is still 84% after averaging the predictions for new data, but I would expect the ensemble performance to be equal to or better than for a single model.
There are even more approaches:
Lazy approach: just deploy one of the 5 models from cross-validation.
A combination of inside-out and parameter donation: First you follow the inside-out approach and retrain the model using the training/validation split that you used within CV before. Then you take the best hyperparameter configuration and train a model using all data for training.
The Fuck-It Approach: Ignore all previous results. Train a random forest using all data. No evaluation. Deploy it. Call it a day.
None of these approaches is perfect, and each has its own trade-offs. It gets even more annoying with more complicated splitting setups like repeated nested cross-validation or conformal prediction.
So unfortunately I don’t have a clear answer on the best way to go from modeling to the final model.
What’s your approach to getting from evaluation to final model?
In Neural Networks, weight averaging the five models can sometimes work pretty well as an alternative to ensembling (especially if you use techniques like git rebasin).
Keeping a held-out test set to pick the best method among the ones you presented can also help.
Concerning the "The F*-It Approach": I was very unpleasantly surprised to see you toss profanity into your post. I really love your blog but I would respectfully ask you to please keep the language professional.