Seven Common Causes of Data Leakage in Machine Learning

Key Steps in data preprocessing, feature engineering, and train-test splitting to prevent data leakage

When I was evaluating AI tools like ChatGPT, Claude, and Gemini for machine learning use cases in my last article, I encountered a critical pitfall: data leakage in machine learning. These AI models created new features using the entire dataset before splitting it into training and test sets — a common cause of data leakage. However, this is not just an AI mistake; humans often make it too.

Data leakage in machine learning happens when information from outside the training dataset seeps into the model-building process. This leads to inflated performance metrics and models that fail to generalize to unseen data. In this article, I’ll walk through seven common causes of data leakage, so that you don’t make the same mistakes as AI 🙂

Image by DALL·E

Problem Setup

To better explain data leakage, let’s consider a hypothetical machine learning use case:

Imagine you’re a data scientist at a major credit card company like American Express. Each day, millions of transactions are processed, and inevitably, some of them are fraudulent. Your job is to build a model that can detect fraud in real-time…