4 Comments

I got my start with Kaggle and I have a soft spot for it (plus two t-shirts, a mug and $500). However, although it's a great way to hone core ML skills, Kaggle competitions have 3 big weaknesses as a training curriculum. You address one here, which I'll call "building infrastructure", where infrastructure in this case includes code, hardware and perhaps most importantly people.

However, in my experience [admittedly a bit dated at this point] there are also two areas where Kaggle can encourage blind spots: data and metrics. Coming out of a diet of Kaggle competitions can leave you with the idea that data is something that just exists. If you're lucky–like in Kaggle–someone just hands it to you. But if not you just need to shake it out of Bob. However, often this is not how things work: maybe the data doesn't exist. Or if it does, it's poor quality, there isn't enough of it, it has the wrong fields, etc, etc. It's not uncommon that the best thing you can do to improve your model is go get more or better data.

Metrics are another area where Kaggle can give you an oversimplified view of the world. It does expose you to a number of different metrics, which is helpful. But since you are handed the final evaluation metric you never have to think about how you are going to evaluate your model. In my experience, this can be one of the trickier parts of the modeling process.

Which isn't to say Kaggle is bad, it's great at what it is, but it only trains you a portion of the modeling process.

Expand full comment

In recent comptetions in order to do well you do need to get external/synthetic data espically in language-related comptetions. I do agree with your point about metrics. Selecting the right metric is undoubtedly one of the most challenging aspects, and it's something that doesn't receive enough practice.

Expand full comment

Interesting! It's been a while since I participated, but back then you were only permitted to use the datasets that Kaggle provided or-occasionally–those plus a few specified outside datasets, but you definitely couldn't go mine your own data or grab arbitrary datasets.

Sounds like an improvement.

Expand full comment

Thanks for the insight here. I get a lot of questions regarding whether or not those new to ML should Kaggle.

Expand full comment