Data Leakage in Preprocessing, Explained: A Visual Guide with Code Examples

DATA PREPROCESSING

10 sneaky ways your preprocessing pipeline leaks

In my experience teaching machine learning, students often come to me with this same problem: “My model was performing great — over 90% accuracy! But when I submitted it for testing on the hidden dataset, it is not as good now. What went wrong?” This situation almost always points to data leakage.

Data leakage happens when information from test data sneaks (or leaks) into your training data during data preparation steps. This often happens during routine data processing tasks without you noticing it. When this happens, the model learns from test data it wasn’t supposed to see, making the test results misleading.

Let’s look at common preprocessing steps and see exactly what happens when data leaks— hopefully, you can avoid these “pipeline issues” in your own projects.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

Definition

Data leakage is a common problem in machine learning that occurs when data that’s not supposed to be seen by a model (like test data or future data) is accidentally used to train the model. This can lead to the model overfitting and not performing well on new, unseen data.

Now, let’s focus on data leakage during the following data preprocessing steps. Further, we’ll also see these steps with specific scikit-learn preprocessing method names and we will see the code examples at the very end of this article.

Missing Value Imputation

When working with real data, you often run into missing values. Rather than removing these incomplete data points, we can fill them in with reasonable estimates. This helps us keep more data for analysis.

Simple ways to fill missing values include:

  1. Using SimpleImputer(strategy='mean') or SimpleImputer(strategy='median') to fill with the average or middle value from that column
  2. Using KNNImputer() to look at similar data points and use their values
  3. Using SimpleImputer(strategy='ffill') or SimpleImputer(strategy='bfill') to fill with the value that comes before or after in the data
  4. Using SimpleImputer(strategy='constant', fill_value=value) to replace all missing spots with the same number or text

This process is called imputation, and while it’s useful, we need to be careful about how we calculate these replacement values to avoid data leakage.

Data Leakage Case: Simple Imputation (Mean)

When you fill missing values using the mean from all your data, the mean value itself contains information from both training and test sets. This combined mean value is different from what you would get using just the training data. Since this different mean value goes into your training data, your model learns from test data information it wasn’t supposed to see. To summarize:

🚨 THE ISSUE
Computing mean values using complete dataset

What We’re Doing Wrong
Calculating fill values using both training and test set statistics

💥 The Consequence
Training data contains averaged values influenced by test data

Mean imputation leakage occurs when filling missing values using the average (4) calculated from all data rows, instead of correctly using only the training data’s average (3), leading to wrong fill values.

Data Leakage Case: KNN Imputation

When you fill missing values using KNN on all your data, the algorithm finds similar data points from both training and test sets. The replacement values it creates are based on these nearby points, which means test set values directly influence what goes into your training data. Since KNN looks at actual nearby values, this mixing of training and test information is even more direct than using simple mean imputation. To summarize:

🚨 THE ISSUE
Finding neighbors across complete dataset

What We’re Doing Wrong
Using test set samples as potential neighbors for imputation

💥 The Consequence
Missing values filled using direct test set information

KNN imputation leakage occurs when finding nearest neighbors using both training and test data (resulting in values 3.5 and 4.5), instead of correctly using only training data patterns to impute missing values (resulting in values 6 and 6).

Categorical Encoding

Some data comes as categories instead of numbers — like colors, names, or types. Since models can only work with numbers, we need to convert these categories into numerical values.

Common ways to convert categories include:

  1. Using OneHotEncoder() to create separate columns of 1s and 0s for each category (also known as dummy variables)
  2. Using OrdinalEncoder() or LabelEncoder() to assign each category a number (like 1, 2, 3)
  3. Using OrdinalEncoder(categories=[ordered_list]) with custom category orders to reflect natural hierarchy (like small=1, medium=2, large=3)
  4. Using TargetEncoder() to convert categories to numbers based on their relationship with the target variable we’re trying to predict

The way we convert these categories can affect how well our model learns, and we need to be careful about using information from test data during this process.

Data Leakage Case: Target Encoding

When you convert categorical values using target encoding on all your data, the encoded values are calculated using the target information from both training and test sets. The numbers that replace each category are averages of target values that include test data. This means your training data gets assigned values that already contain information about the target values from the test set that it wasn’t supposed to know about. To summarize:

🚨 THE ISSUE
Computing category means using complete dataset

What We’re Doing Wrong
Calculating category replacements using all target values

💥 The Consequence
Training features contain future target information

Target encoding leakage occurs when replacing categories with their average target values (A=3, B=4, C=2) using all the data, instead of correctly using only training data averages (A=2, B=5, C=1), leading to wrong category values.

Data Leakage Case: One-Hot Encoding

When you convert categories into binary columns using all your data and then select which columns to keep, the selection is based on patterns found in both training and test sets. The decision to keep or remove certain binary columns is influenced by how well they predict the target in the test data, not just the training data. This means your chosen set of columns is partially determined by test set relationships you weren’t supposed to use. To summarize:

🚨 THE ISSUE
Determining categories from complete dataset

What We’re Doing Wrong
Creating binary columns based on all unique values

💥 The Consequence
Feature selection influenced by test set patterns

One-hot encoding leakage occurs when creating category columns using all unique values (A,B,C,D) from the full dataset, instead of correctly using only categories present in training data (A,B,C), leading to wrong encoding patterns.

Data Scaling

Different features in your data often have very different ranges — some might be in thousands while others are tiny decimals. We adjust these ranges so all features have similar scales, which helps models work better.

Common ways to adjust scales include:

  1. Using StandardScaler() to make values center around 0 with most falling between -1 and 1 (mean=0, variance=1)
  2. Using MinMaxScaler() to squeeze all values between 0 and 1, or MinMaxScaler(feature_range=(min, max)) for a custom range
  3. Using FunctionTransformer(np.log1p) or PowerTransformer(method='box-cox') to handle very large numbers and make distributions more normal
  4. Using RobustScaler() to adjust scales using statistics that aren’t affected by outliers (using quartiles instead of mean/variance)

While scaling helps models compare different features fairly, we need to calculate these adjustments using only training data to avoid leakage.

Data Leakage Case: Standard Scaling

When you standardize features using all your data, the average and spread values used in the calculation come from both training and test sets. These values are different from what you would get using just the training data. This means every standardized value in your training data is adjusted using information about the distribution of values in your test set. To summarize:

🚨 THE ISSUE
Computing statistics using complete dataset

What We’re Doing Wrong
Calculating mean and standard deviation using all values

💥 The Consequence
Training features scaled using test set distribution

Standard scaling leakage occurs when using the full dataset’s average (μ=0) and spread (σ=3) to normalize data, instead of correctly using only training data’s statistics (μ=2, σ=2), leading to wrong standardized values.

Data Leakage Case: Min-Max Scaling

When you scale features using minimum and maximum values from all your data, these boundary values might come from your test set. The scaled values in your training data are calculated using these bounds, which could be different from what you’d get using just training data. This means every scaled value in your training data is adjusted using the full range of values from your test set. To summarize:

🚨 THE ISSUE
Finding bounds using complete dataset

What We’re Doing Wrong
Determining min/max values from all data points

💥 The Consequence
Training features normalized using test set ranges

Min-max scaling leakage occurs when using the full dataset’s minimum (-5) and maximum (5) values to scale data, instead of correctly using only training data’s range (min=-1, max=5), leading to wrong scaling of values.

Discretization

Sometimes it’s better to group numbers into categories rather than use exact values. This helps machine learning models to process and analyze the data more easily.

Common ways to create these groups include:

  1. Using KBinsDiscretizer(strategy='uniform') to make each group cover the same size range of values
  2. Using KBinsDiscretizer(strategy='quantile') to make each group contain the same number of data points
  3. Using KBinsDiscretizer(strategy='kmeans') to find natural groupings in the data using clustering
  4. Using QuantileTransformer(n_quantiles=n, output_distribution='uniform') to create groups based on percentiles in your data

While grouping values can help models find patterns better, the way we decide group boundaries needs to use only training data to avoid leakage.

Data Leakage Case: Equal Frequency Binning

When you create bins with equal numbers of data points using all your data, the cutoff points between bins are determined using both training and test sets. These cutoff values are different from what you’d get using just training data. This means when you assign data points to bins in your training data, you’re using dividing points that were influenced by your test set values. To summarize:

🚨 THE ISSUE
Setting thresholds using complete dataset

What We’re Doing Wrong
Determining bin boundaries using all data points

💥 The Consequence
Training data binned using test set distributions

Equal frequency binning leakage occurs when setting bin cutoff points (-0.5, 2.5) using all the data, instead of correctly using only training data to set boundaries (-0.5, 2.0), leading to wrong grouping of values.

Data Leakage Case: Equal Width Binning

When you create bins of equal size using all your data, the range used to determine bin widths comes from both training and test sets. This total range could be wider or narrower than what you’d get using just training data. This means when you assign data points to bins in your training data, you’re using bin boundaries that were calculated based on the full spread of your test set values. To summarize:

🚨 THE ISSUE
Calculating ranges using complete dataset

What We’re Doing Wrong
Setting bin widths based on full data spread

💥 The Consequence
Training data binned using test set boundaries

Equal width binning leakage occurs when splitting data into equal-size groups using the full dataset’s range (-3 to 6), instead of correctly using only the training data’s range (-3 to 3), leading to wrong groupings.

Resampling

When some categories in your data have many more examples than others, we can balance them using resampling techniques from imblearn by either creating new samples or removing existing ones. This helps models learn all categories fairly.

Common ways to add samples (Oversampling):

  1. Using RandomOverSampler() to make copies of existing examples from smaller categories
  2. Using SMOTE() to create new, synthetic examples for smaller categories using interpolation
  3. Using ADASYN() to create more examples in areas where the model struggles most, focusing on decision boundaries

Common ways to remove samples (Undersampling):

  1. Using RandomUnderSampler() to randomly remove examples from larger categories
  2. Using NearMiss(version=1) or NearMiss(version=2) to remove examples from larger categories based on their distance to smaller categories
  3. Using TomekLinks() or EditedNearestNeighbours() to carefully select which examples to remove based on their similarity to other categories

While balancing your data helps models learn better, the process of creating or removing samples should only use information from training data to avoid leakage.

Data Leakage Case: Oversampling (SMOTE)

When you create synthetic data points using SMOTE on all your data, the algorithm picks nearby points from both training and test sets to create new samples. These new points are created by mixing values from test set samples with training data. This means your training data gets new samples that were directly created using information from your test set values. To summarize:

🚨 THE ISSUE
Generating samples using complete dataset

What We’re Doing Wrong
Creating synthetic points using test set neighbors

💥 The Consequence
Training augmented with test-influenced samples

Oversampling leakage occurs when duplicating data points based on class counts from the entire dataset (A×4, B×3, C×2), instead of correctly using only the training data (A×1, B×2, C×2) to decide how many times to duplicate each class.

Data Leakage Case: Undersampling (Tomek Links)

When you remove data points using Tomek Links on all your data, the algorithm finds pairs of points from both training and test sets that are closest to each other but have different labels. The decision to remove points from your training data is based on how close they are to test set points. This means your final training data is shaped by its relationship with test set values. To summarize:

🚨 THE ISSUE
Removing samples using complete dataset

What We’re Doing Wrong
Identifying pairs using test set relationships

💥 The Consequence
Training reduced based on test set patterns

Undersampling leakage occurs when removing data points based on class ratios from the entire dataset (A×4, B×3, C×2), instead of correctly using only the training data (A×1, B×2, C×2) to decide how many samples to keep from each class.

Final Remarks

When preprocessing data, you need to keep training and test data completely separate. Any time you use information from all your data to transform values — whether you’re filling missing values, converting categories to numbers, scaling features, creating bins, or balancing classes — you risk mixing test data information into your training data. This makes your model’s test results unreliable because the model already learned from patterns it wasn’t supposed to see.

The solution is simple: always transform your training data first, save those calculations, and then apply them to your test data.

🌟 Data Preprocessing + Classification (with Leakage) Code Summary

Let us see how leakage could happen in predicting a simple golf play dataset. This is the bad example and should not be followed. Just for demonstration and education purposes.

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, KBinsDiscretizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)
X, y = df.drop('Play', axis=1), df['Play']

# Preprocess AND apply SMOTE to ALL data first (causing leakage)
preprocessor = ColumnTransformer(transformers=[
('temp_transform', Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('discretizer', KBinsDiscretizer(n_bins=4, encode='ordinal'))
]), ['Temperature']),
('humid_transform', Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('discretizer', KBinsDiscretizer(n_bins=4, encode='ordinal'))
]), ['Humidity']),
('outlook_transform', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1),
['Outlook']),
('wind_transform', Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value=False)),
('scaler', StandardScaler())
]), ['Wind'])
])

# Transform all data and apply SMOTE before splitting (leakage!)
X_transformed = preprocessor.fit_transform(X)
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_transformed, y)

# Split the already transformed and resampled data
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.5, shuffle=False)

# Train a classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

print(f"Testing Accuracy (with leakage): {accuracy_score(y_test, clf.predict(X_test)):.2%}")

The code above is using ColumnTransformer, which is a utility in scikit-learn that allows us to apply different preprocessing steps to different columns in a dataset.

Here’s a breakdown of the preprocessing strategy for each column in the dataset:

Temperature:
Mean imputation to handle any missing values
– Standard scaling to normalize the values (mean=0, std=1)
– Equal-width discretization into 4 bins, meaning continuous values are categorized into 4 equal-width intervals

Humidity:
Same strategy as Temperature: Mean imputation → Standard scaling → Equal-width discretization (4 bins)

Outlook(categorical):
– Ordinal encoding: converts categorical values into numerical ones
– Unknown values are handled by setting them to -1

Wind (binary):
– Constant imputation with False for missing values
– Standard scaling to normalize the 0/1 values

Play (target):
– Label encoding to convert Yes/No to 1/0
– SMOTE applied after preprocessing to balance classes by creating synthetic examples of the minority class
– A simple decision tree is used to predict the target

The entire pipeline demonstrates data leakage because all transformations see the entire dataset during fitting, which would be inappropriate in a real machine learning scenario where we need to keep test data completely separate from the training process.

This approach will also likely show artificially higher test accuracy because the test data characteristics were used in the preprocessing steps!

🌟 Data Preprocessing + Classification (without leakage) Code Summary

Here’s the version without data leakage:

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, KBinsDiscretizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)
X, y = df.drop('Play', axis=1), df['Play']

# Split first (before any processing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)

# Create pipeline with preprocessing, SMOTE, and classifier
pipeline = Pipeline([
('preprocessor', ColumnTransformer(transformers=[
('temp_transform', Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('discretizer', KBinsDiscretizer(n_bins=4, encode='ordinal'))
]), ['Temperature']),
('humid_transform', Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('discretizer', KBinsDiscretizer(n_bins=4, encode='ordinal'))
]), ['Humidity']),
('outlook_transform', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1),
['Outlook']),
('wind_transform', Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value=False)),
('scaler', StandardScaler())
]), ['Wind'])
])),
('smote', SMOTE(random_state=42)),
('classifier', DecisionTreeClassifier(random_state=42))
])

# Fit pipeline on training data only
pipeline.fit(X_train, y_train)

print(f"Training Accuracy: {accuracy_score(y_train, pipeline.predict(X_train)):.2%}")
print(f"Testing Accuracy: {accuracy_score(y_test, pipeline.predict(X_test)):.2%}")

Key differences from the leakage version

  1. Split data first, before any processing
  2. All transformations (preprocessing, SMOTE) are inside the pipeline
  3. Pipeline ensures:
    – Preprocessing parameters learned only from training data
    – SMOTE applies only to training data
    – Test data remains completely unseen until prediction

This approach gives more realistic performance estimates as it maintains proper separation between training and test data.

Technical Environment

This article uses Python 3.7 , scikit-learn 1.5, and imblearn 0.12. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions

About the Illustrations

Unless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.