Using unavailable data at prediction time and mixing magic numbers with real numbers
Welcome back to another edition in this series on easily missed mistakes in machine learning workflows! For those who haven’t read the first one, this is part of a series that focuses predominantly on procedural errors that may not always be very obvious but have a very high potential of deteriorating model performance if they do end up slipping into our development pipeline.
In the first article, we explored common pitfalls like misusing numerical identifiers, mishandling data splits, and overfitting the model to rare feature values.
In this edition, we’ll continue to explore some errors related to data handling, specifically focusing on the following two topics:
- Training with data not available at prediction time
- Mixing magic numbers with real numbers