Dummy Regressor, Explained: A Visual Guide with Code Examples for Beginners

REGRESSION ALGORITHM

Naively choosing the best number for all of your prediction

There are a lot of times when my students come to me saying that they want to try the most sophisticated model out there for their machine learning tasks, and sometimes, I jokingly said, “Have you tried the best ever model first?” Especially in regression case (where we don’t have that “100% accuracy” goal), some machine learning models seemingly get a good low error score but when you compare it with the dummy model, it’s actually… not that great.

So, here’s dummy regressor. Just like in classifier, the regression task also has its baseline model — the first model you have to try to get the rough idea of how much better your machine learning could be.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

Definition

A dummy regressor is a simple machine learning model that predicts numerical values using basic rules, without actually learning from the input data. Like its classification counterpart, it serves as a baseline for comparing the performance of more complex regression models. The dummy regressor helps us understand if our models are actually learning useful patterns or just making naive predictions.

Dummy Regressor is the simplest machine learning model imaginable.

📊 Dataset & Libraries

Throughout this article, we’ll use this simple artificial golf dataset (again, inspired by [1]) as an example. This dataset predicts the number of golfers visiting our golf course. It includes features like outlook, temperature, humidity, and wind, with the target variable being the number of golfers.

Columns: ‘Outlook’, ‘Temperature’ (in Fahrenheit), ‘Humidity’ (in %), ‘Wind’ (Yes/No) and ‘Number of Players’ (numerical, target feature)
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Num_Players': [52,39,43,37,28,19,43,47,56,33,49,23,42,13,33,29,25,51,41,14,34,29,49,36,57,21,23,41]
}
df = pd.DataFrame(dataset_dict)

# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)

# Convert 'Wind' column to binary
df['Wind'] = df['Wind'].astype(int)

# Split data into features and target, then into training and test sets
X, y = df.drop(columns='Num_Players'), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

Evaluating Regression Result

Before getting into the dummy regressor itself, let’s recap the method to evaluate the regression result. While in classification case, it is very intuitive to check the accuracy of the model (just check the ratio of the matching values), in regression, it is a bit different.

RMSE (root mean squared error) is like a score for regression models. It tells us how far off our predictions are from the actual values. Just as we want high accuracy in classification to get more right answers, we want a low RMSE in regression to be closer to the true values.

People like using RMSE because its value is in the same type as what we’re trying to guess.

Having RMSE = 3 can be interpreted that the actual value is within ±3 range from the prediction.
from sklearn.metrics import mean_squared_error

y_true = np.array([10, 15, 20, 15, 10]) # True labels
y_pred = np.array([15, 11, 18, 14, 10]) # Predicted values

# Calculate RMSE using scikit-learn
rmse = mean_squared_error(y_true, y_pred, squared=False)

print(f"RMSE = {rmse:.2f}")

With that in mind, let’s get into the algorithm.

Main Mechanism

Dummy Regressor makes predictions based on simple rules, such as always returning the mean or median of the target values in the training data.

For our golf dataset, a dummy regressor might always predict “40.5” for number of players as that is the median of the training label.

Training Steps

It’s a bit of a lie saying that there’s any training process in dummy regressor but anyway, here’s a general outline:

1. Select Strategy

Choose one of the following strategies:

  • Mean: Always predicts the mean of the training target values.
  • Median: Always predicts the median of the training target values.
  • Constant: Always predicts a constant value provided by the user.
Depends on the strategy, Dummy Regressor makes different numerical prediction.
from sklearn.dummy import DummyRegressor

# Choose a strategy for your DummyRegressor ('mean', 'median', 'constant')
strategy = 'median'

2. Calculate the Metric

Calculate either mean or median, depending on your strategy.

The algorithm is simply calculating the median of the training data— in this case we get 40.5.
# Initialize the DummyRegressor
dummy_reg = DummyRegressor(strategy=strategy)

# "Train" the DummyRegressor (although no real training happens)
dummy_reg.fit(X_train, y_train)

3. Apply Strategy to Test Data

Use the chosen strategy to generate a list of predicted numerical labels for your test data.

If we choose the “median” strategy, the calculated median (40.5) will simply be the prediction for everything.
# Use the DummyRegressor to make predictions
y_pred = dummy_reg.predict(X_test)
print("Label :",list(y_test))
print("Prediction:",list(y_pred))

Evaluate the Model

Dummy regressor with this strategy gives error value of 13.28 as the baseline for future models.
# Evaluate the Dummy Regressor's error
from sklearn.metrics import mean_squared_error

rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"Dummy Regression Error: {rmse.round(2)}")

Key Parameters

There’s only one main key parameter in dummy regressor, which is:

  1. Strategy: This determines how the regressor makes predictions. Common options include:
    mean: Provides an average baseline, commonly used for general scenarios.
    median: More robust against outliers, good for skewed target distributions.
    constant: Useful when domain knowledge suggests a specific constant prediction.
  2. Constant: When using the ‘constant’ strategy, this parameter specifies which class to always predict.
Regardless of the strategy used, the result are all equally bad but for sure our next regression model should have RMSE value lower than 12.

Pros and Cons

As a lazy predictor, dummy regressor for sure have their strengths and limitations.

Pros:

  1. Easy Benchmark: Quickly shows the minimum performance other models should beat.
  2. Fast: Takes no time to set up and run.

Cons:

  1. Doesn’t Learn: Just uses simple rules, so it’s often outperformed by real models.
  2. Ignores Features: Doesn’t consider any input data when making predictions.

Final Remarks

Using dummy regressor should be the first step whenever we have a regression task. They provide a standard base line, so that we are sure that a more complex model actually gives better result rather than random prediction. As you learn more advanced technique, never forget to compare your models against these simple baselines — these naive prediction might be what you first need!

🌟 Dummy Regressor Code Summarized

# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.dummy import DummyRegressor

# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Num_Players': [52,39,43,37,28,19,43,47,56,33,49,23,42,13,33,29,25,51,41,14,34,29,49,36,57,21,23,41]
}
df = pd.DataFrame(dataset_dict)

# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)

# Convert 'Wind' column to binary
df['Wind'] = df['Wind'].astype(int)

# Split data into features and target, then into training and test sets
X, y = df.drop(columns='Num_Players'), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Initialize and train the model
dummy_reg = DummyRegressor(strategy='median')
dummy_reg.fit(X_train, y_train)

# Make predictions
y_pred = dummy_reg.predict(X_test)

# Calculate and print RMSE
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False)}")

Further Reading

For a detailed explanation of the DummyRegressor and its implementation in scikit-learn, readers can refer to the official documentation [2], which provides comprehensive information on its usage and parameters.

Technical Environment

This article uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions.

About the Illustrations

Unless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.

Reference

[1] T. M. Mitchell, Machine Learning (1997), McGraw-Hill Science/Engineering/Math, pp. 59

[2] F. Pedregosa et al., Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html