Unsupervised change detection using reference windows, with a Python example
In a previous article, we explored the basics of concept drift. Concept drift occurs when the distribution of a dataset changes.
This post continues to explore this topic. Here, you’ll learn how to detect concept drift in problems where you don’t have access to labels. This task is challenging because without labels we can’t evaluate models’ performance.
Let’s dive in.
Introduction
Datasets that evolve over time are amenable to concept drift. Changes in distributions can undermine models and the accuracy of their predictions. So, it’s important to detect and adapt to these changes to keep models up to date.
Most change detection approaches rely on tracking the model’s error. The idea is to trigger an alarm when this error increases significantly. Then, some adaptation mechanism kicks in, such as retraining the model.
In the previous article, we argued that having access to labels may be difficult in some cases. Examples appear in many domains, such as fraud detection or credit risk assessment. In the latter, the time it takes for a person to default (and provide a label on their assessment) can take up to several years.
In these cases, you have to detect changes using approaches that do not depend on performance.
Change detection without labels
In general, you have two options to detect changes without labels:
- Track the model’s predictions.
- Track the input data (explanatory variables).
In both cases, change is detected when the distribution changes significantly.
How does this work exactly?
Change detection without labels is done by comparing two samples of data. One sample represents the most recent data, also referred as the detection window. The other contains data from the original distribution (reference window).
So, the detection process is split into two parts:
- Building the two samples