Encoding Categorical Data, Explained: A Visual Guide with Code Example for Beginners

DATA PREPROCESSING

Six ways of matchmaking categories and numbers

Ah, categorical data — the colorful characters in our datasets that machines just can’t seem to understand. This is where “red” becomes 1, “blue” 2, and data scientists turn into language translators (or more like matchmakers?).

Now, I know what you’re thinking: “Encoding? Isn’t that just assigning numbers to categories?” Oh, if only it were that simple! We’re about to explore six different encoding methods, all on (again) a single, tiny dataset (with visuals, of course!) From simple labels to mind-bending cyclic transformations, you’ll see why choosing the right encoding can be as important as picking the perfect algorithm.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

What Is Categorical Data and Why Does It Need Encoding?

Before we jump into our dataset and encoding methods, let’s take a moment to understand what categorical data is and why it needs special treatment in the world of machine learning.

What Is Categorical Data?

Categorical data is like the descriptive labels we use in everyday life. It represents characteristics or qualities that can be grouped into categories.

Why Does Categorical Data Need Encoding?

Here’s the catch: most machine learning algorithms are like picky eaters — they only digest numbers. They can’t directly understand that “sunny” is different from “rainy”. That’s where encoding comes in. It’s like translating these categories into a language that machines can understand and work with.

Types of Categorical Data

Not all categories are created equal. We generally have two types:

  1. Nominal: These are categories with no inherent order.
    Ex: “Outlook” (sunny, overcast, rainy) is nominal. There’s no natural ranking between these weather conditions.
  2. Ordinal: These categories have a meaningful order.
    Ex: “Temperature” (Very Low, Low, High, Very High) is ordinal. There’s a clear progression from coldest to hottest.

Why Care About Proper Encoding?

  1. It preserves important information in your data.
  2. It can significantly impact your model’s performance.
  3. Incorrect encoding can introduce unintended biases or relationships.

Imagine if we encoded “sunny” as 1 and “rainy” as 2. The model might think rainy days are “greater than” sunny days, which isn’t what we want!

Now that we understand what categorical data is and why it needs encoding, let’s take a look at our dataset and see how we can tackle its categorical variables using six different encoding methods.

The Dataset

Let’s use a simple golf dataset to illustrate our encoding methods (and it has mostly categorical columns). This dataset records various weather conditions and the resulting crowdedness at a golf course.

import pandas as pd
import numpy as np

data = {
'Date': ['03-25', '03-26', '03-27', '03-28', '03-29', '03-30', '03-31', '04-01', '04-02', '04-03', '04-04', '04-05'],
'Weekday': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
'Month': ['Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Apr', 'Apr', 'Apr', 'Apr', 'Apr'],
'Temperature': ['High', 'Low', 'High', 'Extreme', 'Low', 'High', 'High', 'Low', 'High', 'Extreme', 'High', 'Low'],
'Humidity': ['Dry', 'Humid', 'Dry', 'Dry', 'Humid', 'Humid', 'Dry', 'Humid', 'Dry', 'Dry', 'Humid', 'Dry'],
'Wind': ['No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes'],
'Outlook': ['sunny', 'rainy', 'overcast', 'sunny', 'rainy', 'overcast', 'sunny', 'rainy', 'sunny', 'overcast', 'sunny', 'rainy'],
'Crowdedness': [85, 30, 65, 45, 25, 90, 95, 35, 70, 50, 80, 45]
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

As we can see, we have a lot of categorical variables. Our task is to encode these variables so that a machine learning model can use them to predict, say, the Crowdedness of the golf course.

Let’s get into it.

Method 1: Label Encoding

Label Encoding assigns a unique integer to each category in a categorical variable.

Common Use 👍: It’s often used for ordinal variables where there’s a clear order to the categories, such as education levels (e.g., primary, secondary, tertiary) or product ratings (e.g., 1 star, 2 stars, 3 stars).

In Our Case: We could use Label Encoding for the ‘Weekday’ column in our golf dataset. Each day of the week would be assigned a unique number (e.g., Monday = 0, Tuesday = 1, etc.). However, we need to be careful as this might imply that Sunday (6) is “greater than” Saturday (5), which may not be meaningful for our analysis.

# 1. Label Encoding for Weekday
df['Weekday_label'] = pd.factorize(df['Weekday'])[0]

Method 2: One-Hot Encoding

One-Hot Encoding creates a new binary column for each category in a categorical variable.

Common Use 👍: It’s typically used for nominal variables where there’s no inherent order to the categories. It’s particularly useful when dealing with variables that have a relatively small number of categories.

In Our Case: One-Hot Encoding would be ideal for our ‘Outlook’ column. We’d create three new columns: ‘Outlook_sunny’, ‘Outlook_overcast’, and ‘Outlook_rainy’. Each row would have a 1 in one of these columns and 0 in the others, representing the weather condition for that day.

# 2. One-Hot Encoding for Outlook
df = pd.get_dummies(df, columns=['Outlook'], prefix='Outlook', dtype=int)

Method 3: Binary Encoding

Binary Encoding represents each category as a binary number (0 and 1).

Common Use 👍: It’s often used when there are only two categories, mostly in a yes-no situation.

In Our Case: While our ‘Windy’ column only has two categories (Yes and No), we could use Binary Encoding to demonstrate the technique. It would result in a single binary column, where one category (e.g., No) is represented as 0 and the other (Yes) as 1.

# 3. Binary Encoding for Wind
df['Wind_binary'] = (df['Wind'] == 'Yes').astype(int)

Method 4: Target Encoding

Target Encoding replaces each category with the mean of the target variable for that category.

Common Use 👍: It’s used when there’s likely a relationship between the categorical variable and the target variable. It’s particularly useful for high-cardinality features in datasets with a reasonable number of rows.

In Our Case: We could apply Target Encoding to our ‘Humidity’ column, using ‘Crowdedness’ as the target. Each ‘Dry’ or ‘Humid’ in the ‘Windy’ column would be replaced with the average crowdedness observed for humid and dry days respectively.

# 4. Target Encoding for Humidity
df['Humidity_target'] = df.groupby('Humidity')['Crowdedness'].transform('mean')

Method 5: Ordinal Encoding

Ordinal Encoding assigns ordered integers to ordinal categories based on their inherent order.

Common Use 👍: It’s used for ordinal variables where the order of categories is meaningful and you want to preserve this order information.

In Our Case: Ordinal Encoding is perfect for our ‘Temperature’ column. We could assign integers to represent the order: Low = 1, High = 2, Extreme = 3. This preserves the natural ordering of temperature categories.

# 5. Ordinal Encoding for Temperature
temp_order = {'Low': 1, 'High': 2, 'Extreme': 3}
df['Temperature_ordinal'] = df['Temperature'].map(temp_order)

Method 6: Cyclic Encoding

Cyclic Encoding transforms a cyclical categorical variable into two numerical features that preserve the variable’s cyclical nature. It typically uses sine and cosine transformations to represent the cyclical pattern. For example, for the column “Month” we’d make it numerical first (1–12) then create two new features:

  • Month_cos = cos(2 π (m — 1) / 12)
  • Month_sin = sin(2 π (m — 1) / 12)

where m is a number from 1 to 12 representing January to December.

Imagine the encoding to be the (x,y) coordinate on this weird clock, starting from 1–12. To preserve the cyclical order, we need to represent them using two columns instead of one.

Common Use: It’s used for categorical variables that have a natural cyclical order, such as days of the week, months of the year, or hours of the day. Cyclic encoding is particularly useful when the “distance” between categories matters and wraps around (e.g., the distance between December and January should be small, just like the distance between any other consecutive months).

In Our Case: In our golf dataset, the best column for cyclic encoding would be the ‘Month’ column. Months have a clear cyclical pattern that repeats every year. This could be particularly useful for our golf dataset, as it would capture seasonal patterns in golfing activity that might repeat annually. Here’s how we could apply it:

# 6. Cyclic Encoding for Month
month_order = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
df['Month_num'] = df['Month'].map(month_order)
df['Month_sin'] = np.sin(2 * np.pi * (df['Month_num']-1) / 12)
df['Month_cos'] = np.cos(2 * np.pi * (df['Month_num']-1) / 12)

Conclusion: The Power of Transformation (and Understanding)

So, there you have it! Six different ways to encode categorical data, all applied to our golf course dataset. Now, all categories are transformed into numbers!

Let’s recap how each method tackled our data:

  1. Label Encoding: Turned our ‘Weekday’ into numbers, making Monday 0 and Sunday 6 — simple but potentially misleading.
  2. One-Hot Encoding: Gave ‘Outlook’ its own columns, letting ‘sunny’, ‘overcast’, and ‘rainy’ stand independently.
  3. Binary Encoding: Compressed our ‘Humidity’ into efficient binary code, saving space without losing information.
  4. Target Encoding: Replaced ‘Windy’ categories with average ‘Crowdedness’, capturing hidden relationships.
  5. Ordinal Encoding: Respected the natural order of ‘Temperature’, from ‘Very Low’ to ‘Very High’.
  6. Cyclic Encoding: Transformed ‘Month’ into sine and cosine components, preserving its circular nature.

There’s no one-size-fits-all solution in categorical encoding. The best method depends on your specific data, the nature of your categories, and the requirements of your machine learning model.

Encoding categorical data might seem like a small step in the grand scheme of a machine learning project, but it’s often these seemingly minor details that can make or break a model’s performance.

⚠️ Caution: Key Considerations in Categorical Encoding

As we wrap up our encoding discussion, let’s highlight some critical points to keep in mind:

  1. Information Loss: Some encoding methods can lead to loss of information. For example, label encoding might impose an unintended ordinal relationship.
  2. The New Category Issue: Most encoding techniques stumble when faced with categories in your test data that weren’t present during training. Always have a strategy for handling these unexpected guests.
  3. Curse of Dimensionality: Techniques like one-hot encoding can dramatically increase the number of features (imagine if you have hundreds different categories like countries or cities!). You might want to select the features that actually matters to encode (like categorizing the rare ones as “Others”).
  4. Document, Document, Document: Your future self (and your colleagues) will thank you for clearly recording your encoding decisions. This transparency is for reproducibility and for understanding any potential biases in your results.

So, well, encoding is about translating your categorical data into a language that machines can understand, while preserving as much meaning as possible. It’s not about finding a perfect encoding, but about choosing the method that best suits your specific needs and constraints. Approach it thoughtfully, and you’ll set a strong foundation for your machine learning works.

🌟 Categorical Encoding Code Summarized

import pandas as pd
import numpy as np

# Create a DataFrame from the dictionary
data = {
'Date': ['03-25', '03-26', '03-27', '03-28', '03-29', '03-30', '03-31', '04-01', '04-02', '04-03', '04-04', '04-05'],
'Weekday': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
'Month': ['Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Mar', 'Apr', 'Apr', 'Apr', 'Apr', 'Apr'],
'Temperature': ['High', 'Low', 'High', 'Extreme', 'Low', 'High', 'High', 'Low', 'High', 'Extreme', 'High', 'Low'],
'Humidity': ['Dry', 'Humid', 'Dry', 'Dry', 'Humid', 'Humid', 'Dry', 'Humid', 'Dry', 'Dry', 'Humid', 'Dry'],
'Wind': ['No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes'],
'Outlook': ['sunny', 'rainy', 'overcast', 'sunny', 'rainy', 'overcast', 'sunny', 'rainy', 'sunny', 'overcast', 'sunny', 'rainy'],
'Crowdedness': [85, 30, 65, 45, 25, 90, 95, 35, 70, 50, 80, 45]
}

df = pd.DataFrame(data)

# 1. Label Encoding for Weekday
df['Weekday_label'] = pd.factorize(df['Weekday'])[0]

# 2. One-Hot Encoding for Outlook
df = pd.get_dummies(df, columns=['Outlook'], prefix='Outlook')

# 3. Binary Encoding for Wind
df['Wind_binary'] = (df['Wind'] == 'Yes').astype(int)

# 4. Target Encoding for Humidity
df['Humidity_target'] = df.groupby('Humidity')['Crowdedness'].transform('mean')

# 5. Ordinal Encoding for Temperature
temp_order = {'Low': 1, 'High': 2, 'Extreme': 3}
df['Temperature_ordinal'] = df['Temperature'].map(temp_order)

# 6. Cyclic Encoding for Month
month_order = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
df['Month_num'] = df['Month'].map(month_order)
df['Month_sin'] = np.sin(2 * np.pi * df['Month_num'] / 12)
df['Month_cos'] = np.cos(2 * np.pi * df['Month_num'] / 12)

# Select and rearrange numerical columns
numerical_columns = [
'Date','Weekday_label',
'Month_sin', 'Month_cos',
'Temperature_ordinal',
'Humidity_target',
'Wind_binary',
'Outlook_sunny', 'Outlook_overcast', 'Outlook_rainy',
'Crowdedness'
]

# Display the rearranged numerical columns
print(df[numerical_columns].round(3))

Technical Environment

This article uses Python 3.7 , pandas 2.1, and numpy 1.26. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions.

About the Illustrations

Unless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.

For a concise visual summary of Decision Tree Classifier, check out the companion Instagram post.