The Art and Science of Dealing with Imbalanced Datasets

Machine learning algorithms typically work in a closed loop manner. At a high level, this involves training the model on the available training dataset followed by evaluation of the prediction performance and repeating the process as more data is available. The choice of evaluation technique depends on factors such as target variable class, the type of algorithm etc. As more data points become available the algorithm learns better and its performance is expected to get better. However, the catch here is not just having more data points but data that is meaningful and diverse. For instance, an image recognition model trained on numerous images of labradors in meadows ended up classifying images of green grass as labradors. Predictions like these not only seem trivial but also negatively impact the end user’s credibility of the recommendations. Thus, understanding the training dataset distribution and being able to implement appropriate data manipulation techniques, as required, is paramount to derive any meaningful predictions. This article focuses on one such challenge with training dataset for supervised algorithms, namely imbalanced class labels, and outlines a few techniques to tackle it.

Often times in the world of machine learning, supervised algorithms are evaluated in terms of their prediction accuracy in other words, what percentage of the records were correctly predicted as belonging to their respective classes. This approach works well if the dataset has an even distribution across its class labels. However, there are situations when this may not be true. Besides, a major challenge with most machine learning algorithms is that they are biased to predict the class in abundance. More specifically for imbalanced datasets, assessing the model based on the accuracy scores will be misleading.

More interestingly, what if your real interest is to determine records belonging to the positive class? Now that’s a needle in a haystack problem. As with other machine learning algorithms, the prediction is not likely to get better with having more data since the overall proportion of positive classes will still not be good enough. Can you think of any such applications? Well, the go-to example for this scenario is credit card fraud transaction detection. As one would expect, fraud transactions are not a frequently occurring incident relative to the volume of transactions that the banks handle. However, it is critical that banks be alerted instantly of fraudulent transactions before things get out of control. In the subsequent sections of this article, I will be providing an overview of some techniques that will help handle imbalanced datasets. The approach to handle imbalanced dataset can be broadly classified into two buckets — modifying the dataset on which the model is trained or using algorithms that can handle class imbalance.

Modifying the dataset to treat imbalanced class issue can be done in a couple of different ways and typically involves some form of data replication. An attempt to randomly remove the majority class records until the classes balance out is called undersampling technique. The fact that this approach involves removing records from the training dataset indicates that there is a likelihood of information loss which might lead to poor model training. Similarly, increasing the proportion of minority class through random replication is referred to as the oversampling technique. The problem with this approach is that the algorithm when trained on this dataset can overfit and hence lead to inaccurate predictions on test data. A more sophisticated and modified version of the oversampling technique is the Synthetic Minority Oversampling Technique or commonly known as the SMOTE approach. In this approach, synthetically generated records of the minority class are added to the dataset. Typically, k-means approach is used to identify the k nearest neighbors (say 5) for each of the minority class records. Based on the degree of oversampling required (say 200%), a proportional number of nearest neighbors are selected in random (2 out of the 5). Synthetic records are generated in such a way that they fall between the chosen minority class record and the randomly selected nearest neighbours. Since there is no replication of minority class records, this approach does not suffer from overfitting issues as observed in the regular oversampling approach.

A completely different approach to handling imbalanced data set is by adopting algorithms that can automatically take care of class imbalances. Most tree based ensemble techniques are efficient for this purpose. Tree based ensemble models work by combining predictions from multiple models (trees) trained on samples from the original dataset. Hence for a given record its class label is determined based on the majority vote from the individual trees. Tree based ensemble models can be broadly classified into two buckets — bagging versus boosting based trees. Bagging based trees tend to divide the data into as many samples, with replacement, as the number of trees (say n) to be combined and then train the model of choice on each of the n samples. The model of choice can be any technique such as — regression, neural networks to name a few. The fact that the individual tree’s predictor variables splits the dataset in the same order makes it less diverse. This is where random forest based trees fit in. It’s similar to bagging technique except with regard to how predictor variables are picked for the individual trees and typically not all predictor variables are used for each of the trees. On the other hand, boosting based trees improve prediction accuracy for each iteration by solely focusing on the misclassified records from the previous run. Depending on how the misclassified records are treated, boosting can be of different types. For instance, adaptive boosting technique does so by assigning higher weight to the misclassified records while gradient boosting focuses on the error value from each iteration.

This article only provides an overview of the different techniques to deal with imbalance datasets and each of the above-mentioned techniques by itself are topics for much detailed discussion. I hope this article serves as a reference and a good starting point while dealing with imbalanced datasets.