
When creating classification models in machine learning, one of the common issues we encounter is data imbalance. For instance, when predicting credit card fraud or some rare diseases, the positive-negative ratio can be 1 to 10 or even 1 to 100. This can result in a significant problem: the model may tend to predict every record as negative to achieve a seemingly high accuracy, even though this accuracy is deceptive.
How to address it? There are 3 commonly used ways: changing threshold, changing evaluation metric and changing sampling method.
Changing Threshold
Let’s take logistic regression as an example. By default, the threshold between positive (1) and negative (0) is set at 0.5. This means that when the predicted model score is greater than 0.5, the data point is labeled as 1, and if not, it’s labeled as 0.
When dealing with imbalanced data, it’s highly likely that most of the predicted values, when applied to new datasets, will fall below 0.5. To address this, we can lower the threshold from 0.5 to 0.1, which will help balance the predictions.
Changing Evaluation Metric
For imbalanced datasets, accuracy is not the best metric to pick for evaluation. Depends on the business need, it can be F1 score which combines both precision and recall.
Changing Sampling Method
This will be the main topic of this article, and I will be breaking down this main method into the following buckets:
Over-sampling
Essentially, what it does is, in order to make the training set more balanced, we can draw more minority records. Since it is simply repeating minority records, it will put more weight on the minority cases. However, if part of the minority records are wrong or noise, the error will be amplified too.
In short, the biggest risk of using over-sampling is overfitting on minority data points.
Under-Sampling
This follows a similar mindset as oversampling, but in reverse. Instead of repeating minority cases, we only select a portion of the…