Performance Metrics For Binary Classifier

Discussion On ROC Metric

Both of the core projects that drove most of AIBMod’s research were binary classification problems.

In the case of the Credit Loan model it was predict DEFAULT or NO_DEFAULT.

For the Credit Card Fraud Detection model it was FRAUD or NOT_FRAUD.

In each case, the model is actually predicting a probability of either of the two classes with the probabilities across the two classes summing to 1.0.

In both of the core problems, the classes were highly imbalanced. Roughly 8.1% of the Credit Loan dataset were default examples and for the Credit Card Fraud Detection it was even worse with roughly 3.5% of the dataset being fraudulent.

Interestingly, both of the above problems were judged with the AUC ROC (“ROC“) performance metric which is a ranking statistic.

An ROC metric typically ranges between 50% (random class separation – the expected score from an untrained model or from simply arbitrarily guessing the class) and 100% (perfect class separation) – so the higher the score the better. An ROC score of less than 50% means the classes have probably been swapped around in the model.

The ROC has an important statistical property – the ROC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance.

So the respective model would be applied to the relevant validation / test dataset and the probabilities of each class would be determined for each loan application / credit card transaction. These probabilities are then compared to the actual classes and the ROC statistic is calculated.

The ROC metric, whilst it is good at testing the relative order of the predicted probabilities against each other, is independent of the range of the probabilities themselves. So if the model generated a set a probabilities for class 1 which ranged between 0% and 50% – say, if one were to multiply each predicted probability by 2.0 so that now they ranged between 0% and 100%, the ROC score would be identical between both these sets of probabilities.

So, if all one cared about was the ranking of predictions on a basket of ‘data’ the ROC metric would be sufficient.

But most users probably would like more ‘feedback’.

For instance, if the model gives a probability of default on a loan of 10%, it would be beneficial if that was a ‘real life’ expectation of default. The model could then be used for pricing the interest that would be appropriate for the loan. The model could also be used to ascertain the riskiness of loans already made – namely the interest on the originated loan might be 5%, but the model now says that the loan interest should be 3% – meaning the risk in the loan has reduced. Conversely, a loan originated at 5% may now be pricing at 10% – meaning there has been a deterioration in the creditworthiness of the loan holder and intervention / reserving may be appropriate.

Furthermore, having a model where the probabilities could be considered as ‘real life’ expectations is essential for any type of feature importance analysis. If the feature analysis is determining changes in probabilities that are in no way related to ‘real life’ then the model is merely ascertaining sensitivities to ‘non real world’ outputs.

Discussion On Logloss Metric

As the ability to create feature importance analysis was a high priority to AIBMod, very early in the development process it was determined that cross entropy logloss (“Logloss“) would be as important as ROC.

Logloss is a loss metric that compares the model generated probability for each prediction to the actual class – the log of the difference is calculated and then averaged across the validation / test dataset. Interestingly, this statistic is usually the training metric used by deep learning models for classification problems – namely, this is the loss that the model is trying to minimise.

During the very early days of development, AIBMod noticed that it was possible to train a model which had a gently improving validation ROC (i.e as the training epochs increased the validation ROC would gradually increase) yet the validation Logloss would start overtraining quite fast.

This was in contrast to gradient boosted trees whereby the Logloss and ROC would both tend to validate well without overtraining – namely a good validation ROC score would go hand in hand with a good Logloss result.

AIBMod gradually understood in the deep learning model it was developing which hyperparameters were helpful for balanced and stable ROC scores, and which were good for balanced and stable Logloss scores – this resulted in a model that was overall very stable and well regularised.

Eventually, AIBMod only considered a model to be better than another if both the ROC and Logloss validation scores improved – an improvement in just one of these two metrics was not considered an improved model.

Part of the art of the training process was to build a model whereby the optimal validation ROC and Logloss scores occurred at the same epoch meaning a trained model was optimal, or very close to optimal, for both of these validation metrics at the same training epoch. This took considerable time to master.

Discussion on Accuracy Metric

In the last quarter of 2023, when AIBMod had completed its research on the binary classifier deep learning model, it wanted to ascertain the state of the deep learning community on tabular data. There were a couple of papers that looked relevant but the performance metric they used was accuracy. These papers are discussed here.

To be more precise, one paper used accuracy and the other used balanced accuracy.

This was based on the principle that most research papers that had analysed the datasets before had used accuracy as the performance metric.

One of the major problems with accuracy is its overly simplistic, black or white, nature.

In a binary class problem, if a predicted probability is over 50% then it is considered class 1 or class 0 otherwise. These predicted classes are then compared to the actual classes and the number of correct classifications is divided by the total number of predictions.

This is a very simplistic measure – consider the following situations:

If a model prediction is 51% but the real probability should be 90%, in an accuracy based metric it would get the full recognition, but a Logloss metric would penalise the 51% prediction more, and probably also a ROC statistic as well if it was out of its correct order with others.
If a model prediction was 49.9% the raw classification would be for class 0. However, if the real class was class 1 with a real world probability of 51%, the accuracy metric would give no recognition at all whereas Logloss would give only a very small penalty.