Get started with Kaggle and submit a (good) first solution
Kaggle is a fun platform hosting a variety of data science and machine learning competitions — covering topics such as sports, energy or autonomous driving.
In this post we will give an introduction to Kaggle, and tackle the introductory “Titanic” challenge. We will explain how to approach and solve such a challenge, and demonstrate this with a top 7% solution for “Titanic”.
You can find the full code on Github, and with that following along while reading this article, as well as reproduce my exact score. In it, we follow some things I consider best practice for Python and use helpful tools, such as mypy and poetry. With that being said, let’s dive right into it.
Kaggle
Kaggle offers a wide variety of data science / machine learning competitions, see the intro for examples. It is a great way to test and improve your data science / ML knowledge and learn how to solve problems hands-on. Plus, you can even win monetary prices! However, Kaggle is populated by some of the best data scientists and ML people out there — and prices are only given to the few top solutions (out of several hundreds or thousands) — thus winning here is extremely hard and rare, and should not be your main motivation when starting.
Each (most?) competition comes with a story — a purpose — and a dataset. You are then tasked to understand the data, and solve the desired problem. If you want, you can submit your solutions to the platform, and get ranked on a public leaderboard — that is, your solution is ranked on a held-out test set. However, to avoid cheating or optimizing against this by simply spamming submissions, once the competition time (usually a few weeks to months) has expired, all competitors / teams are ranked vs a private test set — deciding the ultimate winners.
In the following, we will show how to understand the data, create a model, and submit to Kaggle following the introductory Titanic competition.