In time series analysis, it is valuable to understand if one series influences another. For example, it is useful for commodity traders to know if an increase in commodity A leads to an increase in commodity B. Originally, this relationship was measured using linear regression, however, in the 1980s Clive Granger and Paul Newbold showed this approach yields incorrect results, particularly for non-stationary time series. As a result, they conceived the concept of cointegration, which won Granger a Nobel prize. In this post, I want to discuss the need and application of cointegration and why it is an important concept Data Scientists should understand.
Before we discuss cointegration, let’s discuss the need for it. Historically, statisticians and economists used linear regression to determine the relationship between different time series. However, Granger and Newbold showed that this approach is incorrect and leads to something called spurious correlation.
A spurious correlation is where two time series may look correlated but truly they lack a causal relationship. It is the classic ‘correlation does not mean causation’ statement. It is dangerous as even statistical tests may well say that there is a casual relationship.
An example of a spurious relationship is shown in the plots below:
Here we have two time series A(t) and B(t) plotted as a function of time (left) and plotted against each other (right). Notice from the plot on the right, that there is some correlation between the series as shown by the regression line. However, by looking at the left plot, we see this correlation is spurious because B(t) consistently increases while A(t) fluctuates erratically. Furthermore, the average distance between the two time series is also increasing…