How to know if you are actually overfitting
Explanation from my experience in Fraud Detection competition
Understanding overfitting
There are usually two reasons why the Training score and Test score differ:
-
Overfitting
-
Out of Domain Data
i.e., test and train data are from different times, or different clients etc…
There is a nice trick to see what is causing the difference in scores between the training and the test data: Determining OOF (out-of-fold) scores. The OOF score is basically a score on unseen data but within the training data domain itself. It nicely controls for the effect of “out of domain” data. So,
-
OOF (out-of-fold) scores < Train scores ==> Overfitting
-
Test scores < OOF scores ==> Out-of-domain data
Example of only over-fitting
From the beginning I have suffered mainly with overfitting rather than
the out-of-domain data issue. My OOF<<Train
(indicating overfitting)
and Test>OOF
(not indicating “out-of-domain” issue):
train | OOF | test (public) | |
---|---|---|---|
baseline AUC | 0.99 | 0.9292 | 0.9384 |
Example of both over-fitting and out of domain issue
In another case, where I accidentally changed some values of columns
in the test dataset as NaNs, I saw the following. Here OOF<<Train
(indicating overfitting) and also Test<<OOF
(indicating
“out-of-domain” issue):
train | OOF | test (public) | |
---|---|---|---|
AUC | 0.9971 | 0.9426 | 0.9043 |
Looking at the important AV columns, and probing into those columns allowed me to fix the issue.