A Grandmaster's Guide to Machine Learning Challenges
Picking the Right Challenges and Succeeding
It’s been a while since I participated in a machine learning challenge. But I needed a break from writing and when I came across a competition to predict water supply, I immediately dove in.
But the thing is: I’ve never been that good with these competitions. And I always felt that I approached them the wrong way. But what is the right way?
Fortunately, there are many people that are really good at these challenges. I came across an interview with Christof Henkel on AI stories. He is a Kaggle grandmaster with a current rank of #2 but was ranked #1 in the world not so long ago. He participated in 75 challenges on Kaggle, and got first place in 7 of them, which is quite the achievement.
There’s a lot to learn from him.
I recommend listening to the entire podcast, but here are the two points that resonated most with me: how to succeed in a competition and how to pick one.
How to succeed in a competition
First, he has a high-level look at the data which also allows him to decide whether or not to work on the challenge, but more on that later.
For modeling, he divides his approach into 3 steps and only moves from one step to the next when the step is finished:
Step 1: Create an end-to-end-pipeline
Create a very simple pipeline of reading in data, creating features, training a (simple) model, and computing the competition-specific metric. An emphasis on the last point: It’s important to replicate as closely as possible the validation setup for the leaderboard because this is what allows you to make many experiments without overfitting the leaderboard.
Step 2: Experiment and research
Based on a simple model, iterate through many ideas. Read research papers, check other competitions, look at the data and maybe even external data, reduce noise, augment the data, use different losses, post-process the predictions, etc. The more you experiment, the better. It’s crucial to stick with the simple model so you can experiment quickly. Rely on your validation setup to evaluate the experiments.
Step 3: Scale your approach
At this point, you are converging to your final model and are no longer experimenting. Instead, he talks about “scaling up” the model, such as using all of the data (if you used a subset before), tuning the model, using a deeper model (he mostly does deep learning competitions), and so on. This only happens in the last 2-3 weeks of a competition.
Hearing about his approach was sobering for me. My former approach was almost the opposite: I often started with a too-complex model and quickly jumped to step #3 to optimize the model. I still would do experimentation, but since I already had a complex model, experimentation was more expensive, so I didn’t experiment much. Also, I did not always have a good validation setup, so I only optimized against the leaderboard. This is problematic: It makes one addicted to climbing the leaderboard and chasing small improvements instead of exploring and experimenting more widely.
Christof Henkel also has, IMHO, good advice on how to pick a challenge. He says that success in a challenge is correlated with how much time you spend on the challenge. Meaning you should pick a challenge that motivates you, and not the one that seems easiest. It can also make sense to investigate the data of a challenge before deciding on it.
To summarize: Pick something that motivates you. First goal: create a feedback loop that allows you to experiment. Then experiment. Optimize in the end.
Feels like there’s a metaphor for life in that.
Would you set a Kaggle competition for us to compete for your upcoming book?
Great article - the approach transfer well to many cases. Another side effect of starting with complex models is the lack of understanding what is influencing the result the most.