REGRESSION ALGORITHM
Naively choosing the best number for all of your prediction
There are a lot of times when my students come to me saying that they want to try the most sophisticated model out there for their machine learning tasks, and sometimes, I jokingly said, “Have you tried the best ever model first?” Especially in regression case (where we don’t have that “100% accuracy” goal), some machine learning models seemingly get a good low error score but when you compare it with the dummy model, it’s actually… not that great.
So, here’s dummy regressor. Just like in classifier, the regression task also has its baseline model — the first model you have to try to get the rough idea of how much better your machine learning could be.
Definition
A dummy regressor is a simple machine learning model that predicts numerical values using basic rules, without actually learning from the input data. Like its classification counterpart, it serves as a baseline for comparing the performance of more complex regression models. The dummy regressor helps us understand if our models are actually learning useful patterns or just making naive predictions.
📊 Dataset & Libraries
Throughout this article, we’ll use this simple artificial golf dataset (again, inspired by [1]) as an example. This dataset predicts the number of golfers visiting our golf course. It includes features like outlook, temperature, humidity, and wind, with the target variable being the number of golfers.
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Num_Players': [52,39,43,37,28,19,43,47,56,33,49,23,42,13,33,29,25,51,41,14,34,29,49,36,57,21,23,41]
}
df = pd.DataFrame(dataset_dict)
# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
# Convert 'Wind' column to binary
df['Wind'] = df['Wind'].astype(int)
# Split data into features and target, then into training and test sets
X, y = df.drop(columns='Num_Players'), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
Evaluating Regression Result
Before getting into the dummy regressor itself, let’s recap the method to evaluate the regression result. While in classification case, it is very intuitive to check the accuracy of the model (just check the ratio of the matching values), in regression, it is a bit different.
RMSE (root mean squared error) is like a score for regression models. It tells us how far off our predictions are from the actual values. Just as we want high accuracy in classification to get more right answers, we want a low RMSE in regression to be closer to the true values.
People like using RMSE because its value is in the same type as what we’re trying to guess.
from sklearn.metrics import mean_squared_errory_true = np.array([10, 15, 20, 15, 10]) # True labels
y_pred = np.array([15, 11, 18, 14, 10]) # Predicted values
# Calculate RMSE using scikit-learn
rmse = mean_squared_error(y_true, y_pred, squared=False)
print(f"RMSE = {rmse:.2f}")
With that in mind, let’s get into the algorithm.
Main Mechanism
Dummy Regressor makes predictions based on simple rules, such as always returning the mean or median of the target values in the training data.
Training Steps
It’s a bit of a lie saying that there’s any training process in dummy regressor but anyway, here’s a general outline:
1. Select Strategy
Choose one of the following strategies:
- Mean: Always predicts the mean of the training target values.
- Median: Always predicts the median of the training target values.
- Constant: Always predicts a constant value provided by the user.
from sklearn.dummy import DummyRegressor# Choose a strategy for your DummyRegressor ('mean', 'median', 'constant')
strategy = 'median'
2. Calculate the Metric
Calculate either mean or median, depending on your strategy.
# Initialize the DummyRegressor
dummy_reg = DummyRegressor(strategy=strategy)# "Train" the DummyRegressor (although no real training happens)
dummy_reg.fit(X_train, y_train)
3. Apply Strategy to Test Data
Use the chosen strategy to generate a list of predicted numerical labels for your test data.
# Use the DummyRegressor to make predictions
y_pred = dummy_reg.predict(X_test)
print("Label :",list(y_test))
print("Prediction:",list(y_pred))
Evaluate the Model
# Evaluate the Dummy Regressor's error
from sklearn.metrics import mean_squared_errorrmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"Dummy Regression Error: {rmse.round(2)}")
Key Parameters
There’s only one main key parameter in dummy regressor, which is:
- Strategy: This determines how the regressor makes predictions. Common options include:
– mean: Provides an average baseline, commonly used for general scenarios.
– median: More robust against outliers, good for skewed target distributions.
– constant: Useful when domain knowledge suggests a specific constant prediction. - Constant: When using the ‘constant’ strategy, this parameter specifies which class to always predict.
Pros and Cons
As a lazy predictor, dummy regressor for sure have their strengths and limitations.
Pros:
- Easy Benchmark: Quickly shows the minimum performance other models should beat.
- Fast: Takes no time to set up and run.
Cons:
- Doesn’t Learn: Just uses simple rules, so it’s often outperformed by real models.
- Ignores Features: Doesn’t consider any input data when making predictions.
Final Remarks
Using dummy regressor should be the first step whenever we have a regression task. They provide a standard base line, so that we are sure that a more complex model actually gives better result rather than random prediction. As you learn more advanced technique, never forget to compare your models against these simple baselines — these naive prediction might be what you first need!
🌟 Dummy Regressor Code Summarized
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.dummy import DummyRegressor# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Num_Players': [52,39,43,37,28,19,43,47,56,33,49,23,42,13,33,29,25,51,41,14,34,29,49,36,57,21,23,41]
}
df = pd.DataFrame(dataset_dict)
# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
# Convert 'Wind' column to binary
df['Wind'] = df['Wind'].astype(int)
# Split data into features and target, then into training and test sets
X, y = df.drop(columns='Num_Players'), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Initialize and train the model
dummy_reg = DummyRegressor(strategy='median')
dummy_reg.fit(X_train, y_train)
# Make predictions
y_pred = dummy_reg.predict(X_test)
# Calculate and print RMSE
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False)}")
Further Reading
For a detailed explanation of the DummyRegressor and its implementation in scikit-learn, readers can refer to the official documentation [2], which provides comprehensive information on its usage and parameters.
Technical Environment
This article uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions.
About the Illustrations
Unless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.
Reference
[1] T. M. Mitchell, Machine Learning (1997), McGraw-Hill Science/Engineering/Math, pp. 59
[2] F. Pedregosa et al., Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html