A Python tutorial on how to implement oversampling and how to make custom variations
Synthetic Minority Oversampling Technique (SMOTE) is commonly used to handle class imbalances in datasets. Suppose there are two classes and one class has far more samples (majority class) than the other (minority class). In that case, SMOTE will generate more synthetic samples in the minority class so that it’s on par with the majority class.
In the real world, we’re not going to have balanced datasets for classification problems. Take for example a classifier that predicts whether a patient has sickle cell disease. If a patient has abnormal hemoglobin levels (6–11 g/dL), then that’s a strong predictor of sickle cell disease. If a patient has normal hemoglobin levels (12 mg/dL), then that predictor alone doesn’t indicate whether the patient has sickle cell disease.
However, about 100,000 patients in the USA are diagnosed with sickle cell disease. There are currently 334.9 million US citizens. If we have a dataset of every US citizen and label or not the patient has sickle cell disease, we have 0.02% of people who have the disease. We have a major class imbalance. Our model can’t pick up meaningful features to predict this anomaly.