Pitfalls of Comparing Model Results

Discussion On Comparing Model Results

Most of the development time was spent comparing AIBMod model results to models published by winning teams on Kaggle.

AIBMod quickly learned that the only way to meaningfully compare two different models was to ensure that the training and validation / test splits were identical in both models. Different shuffling of training data meant comparing K-Fold cross validation almost meaningless on a fold basis. Even when comparing statistics across all the folds there can be discrepancies for differently shuffled training data sets.

Below is a list of other issues that have been encountered whilst comparing models published on Kaggle:

When taking dummies of categorical features the training, validation and test datasets would be merged. This creates the potential for information leakage between these different datasets;
When creating manually created features such as count features or averages would again be created by joining together the training, validation and test datasets which again creates an opportunity for information leakage;
Some teams have used a technique called semi-supervised learning whereby predictions are first made on the test set. The test set with predictions is then aggregated with the training data and everything is trained together before the final predictions are created. This technique can reduce variance thus creating more stable models.

The three cases above highlight the problem with Kaggle competitions – namely because the test data (albeit without prediction values) is available it is possible to incorporate aspects of the test data within the training process. In real life this would not be possible – the point of a model is to make predictions on data that has never been seen before.

The techniques and approaches used within Kaggle competitions may be useful when trying to win such competitions, but they may not be the most useful in real life situations or scenarios.

When developing its model framework, AIBMod took great care to ensure that there was no information leakage whatsoever between the training data and validation / test data sets. Any transformations were derived only by looking at the training data and these were then applied to the validation / test data.

Hence, when benchmarking Kaggle models to the AIBMod model, first of all, any shuffling of the training data was synchronised between the two (i.e shuffling seeds were matched along with the shuffling functions themselves) and any information leakage in the Kaggle model was removed. To the extent the Kaggle model used feature engineering, this would still be applied to the Kaggle model (not to the AIBMod model) but only in a way in which there was no information leakage between the training and validation data.

Therefore, models have been compared on a like-for-like basis and with no information leakage from the validation data.