The Suitable Way to Handle Imbalanced Datasets in an Organization
Tips to organize imbalanced datasets.
Resample the training set
Apart from using different evaluation criteria, one can also work on getting a different dataset. Two approaches to make a balanced dataset out of an Imbalanced datasets one are under-sampling and oversampling.
Under-sampling balances the dataset by reducing the size of the abundant class. This method is used when the quantity of data is sufficient. By keeping all samples in the rare class and randomly selecting an equal number of samples in the abundant class, a balanced new dataset can be retrieved for further modelling.
On the contrary, oversampling is used when the quantity of data is insufficient. It tries to Imbalanced datasets the dataset by increasing the size of rare samples. Rather than getting rid of abundant samples, new rare samples are generated by using e.g., repetition, bootstrapping or SMOTE (Synthetic Minority Over-Sampling Technique).
Use K-fold Cross-Validation in the right way
It is noteworthy that cross-validation should be applied properly while using an over-sampling method to address Imbalanced datasets problems. Keep in mind that over-sampling takes observed rare samples and applies bootstrapping to generate new random data based on a distribution function. If cross-validation is applied after over-sampling, basically what we are doing is overfitting our model to a specific artificial bootstrapping result. That is why cross-validation should always be done before over-sampling the data, just as how feature selection should be implemented. Only by resampling the data repeatedly, randomness can be introduced into the Imbalanced datasets to make sure that there won’t be an overfitting problem.
Ensemble different resampled datasets
The easiest way to successfully generalize a model is by using more data. The problem is that out-of-the-box classifiers like logistic regression or random forest tend to generalize by discarding the rare class. One easy best practice is building n models that use all the samples of the rare class and n-differing samples of the abundant class. Given that you want to ensemble 10 models, you would keep e.g. the 1.000 cases of the rare class and randomly sample 10.000 cases of the abundant class. Then you just split the 10.000 cases in 10 chunks and train 10 different models. This approach is simple and perfectly horizontally scalable if you have a lot of data, since you can just train and run your models on different cluster nodes. Ensemble models also tend to generalize better, which makes this approach easy to handle.
Resample with different ratios
The previous approach can be fine-tuned by playing with the ratio between the rare and the abundant class. The best ratio heavily depends on the data and the models that are used. But instead of training all models with the same ratio in the ensemble, it is worth trying to ensemble different ratios. So if 10 models are trained, it might make sense to have a model that has a ratio of 1:1 (rare: abundant) and another one with 1:3, or even 2:1. Depending on the model used this can influence the weight that one class gets.