I’ve done it many times myself — hitting run on some model training code and having a “WOW” moment when the error scoring comes out great. Suspiciously great. Digging through the feature engineering code, there’s a calculation that baked future data into the training data, and fixing the feature pumps those mean squared errors back up to reality. Now where’s that whiteboard again…
Time series problems have a number of unique pitfalls. Luckily, with some diligence and a little practice, you’ll be accounting for these pitfalls long before typing from sklearn import into your notebook. Here are three things to look out for, and some scenarios where you might run into them.
This one’s almost certainly the first hazard you’ll encounter with time series, and overwhelmingly the most frequent one I see in entry-level portfolios (looking at you, generic stock market forecasting project). The good news is that it’s generally the easiest to avoid.
The Problem: Simply put, look-ahead bias is when your model is trained using future data it would not have access to in reality.
The typical way you’d introduce this issue into your code is by randomly splitting training and testing data into two chunks of a predetermined size (e.g. 80/20). Random sampling will mean both your training and test data cover the same time period, so you’ll have “leaked” knowledge of the future into your model.
When it comes time to validate with the test data, the model already knows what happens. You’ll inevitably get some pretty stellar, yet bogus error scores this way.
The Fix: Split your dataset using a cutoff in time rather than holding out a percentage of the data.
For example, if I have data that covers 2013–2023, I might set 2013–2021 as my training data and 2022–2023 as my test data. In a simple use case, the test data then covers a time period the model is completely naive to, and your error scoring should be accurate. Remember, this also applies to the likes of k-fold cross-validation.