ROC curves are a common approach used to evaluate the performance of binary classification models. However, when dealing with imbalanced datasets, they can also provide over-optimistic and not entirely meaningful results.
A brief overview of ROC and Precision-Recall curves: we are essentially plotting the classification metrics against each other for different decision thresholds. We commonly measure the area under the curve (or AUC), to give us an indication of the models performance. Follow the links to learn more about ROC and Precision- Recall Curves.
To illustrate how ROC curves can be over optimistic, I have built a classification model on a credit card fraud dataset taken from Kaggle. The dataset comprises 284,807 transactions, of which 492 are fraudulent.
Note: The data is free to use for commercial and non-commercial purposes without permission, as outlined in the Open Data Commons license attributed to the data.
Upon examining the ROC curve, we might be led to believe the model performance is better than it actually is, since the area under this curve is 0.97. As we have previously seen, the false positive rate can be overly optimistic for imbalanced classification problems.
A more robust approach would be to utilise the precision-recall curve. This provides a much more robust estimate of our model’s performance. Here we can see the area under the precision-recall curve (AUC-PR) is much more conservative at 0.71.
Taking a balanced version of the dataset where fraudulent and non-fraudulent transactions are 50:50, we can see that the AUC and AUC-PR are much closer together.
The notebook for generating these charts is available in my GitHub repo.
There are ways to uplift the performance of classification models on imbalanced datasets, I explore these in my article on synthetic data.