What do datasets from different domains like fraud detection in banking, to bidding on ads in real-time or intrusion detection and prevention, have in common?
When you’re using sensitive data, which inevitably contains a smaller number of rare but “interesting” events (e.g., fraudulent credit card activity or a corrupted server scanning its network), most machine learning algorithms don’t work very well. Luckily, these seven techniques can help you train a classifier to detect the abnormal class.
Table of Contents
1. Use the right evaluation metrics
Imbalanced data sets are tricky to handle. The misleading results can be dangerous and lead us astray. Imagine we used accuracy to judge how good our model was. It would have a very high accuracy of 99.8% because all the testing samples belonged to “0”, but in actuality, it would provide no meaningful information for us. Depending on your situation, the following alternative evaluation metrics may be applied:- Degrees of precision or specificity: how many selected instances are relevant
- Recall/Sensitivity: the number of relevant instances
- F1 score: a weighted average of precision and recall
- MCC: A statistical measure of how well the machine learning algorithm performed
- AUC: ratio between true-positive rate and false positive
2. Resample the training set
When you’re trying to create a balanced dataset from an unbalanced one, there are two ways to go about it. These are called under-sampling and over-sampling.-
Under-sampling
-
Over-sampling
3. Use K-fold cross-validation in the right way
Cross-validation should be applied properly when applying over-sampling to address imbalance problems. There are two types of over-sampling, including under-sampling and oversampling. Oversampling is the use of stratified sampling to create artificial data for an individual hard-to-obtain observation by using heuristics or randomness. Oversampling can be conducted before or after the steps of feature selection and cross-validation. Cross-validation should always come before over-sampling and, when implemented, can help reduce costs by concentrating on a manageable subset of the model as opposed to devoting time and resources to achieving a perfect fit for each sample in every dataset.4. Ensemble different resampled datasets
The easiest way to generalize a model is to use more data. The problem is that out-of-the-box classifiers like logistic regression or random forest tend to be discarding the rare class. One effective best practice is building n models that use all the samples of the rare class and n-diffing samples of the abundant class. Given that you want to ensemble 10 models, you would keep e.g. the 1,000 cases of the rare class and randomly sample 10,000 cases of the abundant class. Then you just split the 10,000 cases in 10 chunks and train 10 different models. Although ensemble models are usually harder to train, they generalize better. This makes them easier to handle and less sensitive to overfitting. Such models can also be scaled horizontally by training the algorithm on different workers in a cluster.5. Resample with different ratios
The previous approach to classification can be improved by playing with the ratio between rare and abundant classes. There are different ratios you can try depending on your data and the model you use. But in most cases, it’s worth trying to get different models trained with different ratios than training all of them with the same one. So if we’re using 10 models, it might make sense to have a model that has a ratio of 1:1 (rare:abundant) and another one with 1:3, or even 2:1. It all depends on the model used, but this will determine how much weight is given to one class versus another.6. Cluster the abundant class
Instead of relying on random samples to cover the variety of the training samples, cluster the abundant class in r groups, with r being the number of cases in r. For each group, only the medoid (centre of cluster) is kept. The model is then trained with the rare class and the medoids only.7. Design your models
The problem with the traditional methods of testing imbalanced data is that the data is always resampled. But you don’t need to resample if you’re using a model specifically trained to work with imbalanced data, like XGBoost. Of course, you’ll still have to resample the data, but not all models require this. The best way to handle it, in small sample sizes, is to design a cost function that penalizes wrong classifications of the rare class more than wrong associations of the abundant class. This allows you to find many models designed with the imbalanced data in mind. For example, tweaking an SVM to penalize incorrect classifications of the rare class at about 1/4 the rate at which this group exists in relation to other groups.Summary
This list is not comprehensive and should only be used as a starting point, but it’s a great place to get started if you are having trouble with imbalanced data. There isn’t one best approach that applies to all problems, so try different techniques and models to see what works best for you. Try to be creative when applying different approaches, and don’t forget that in many industries (e.g., fraud detection, real-time bidding), industry rules can change as time goes on. Verify with your past data to make sure it still reflects reality.[mailerlite_form form_id=1]