17  Spurious Regression

17.4 I(1) process

When we are dealing with integrated processes of order one, there is an additional complication. Even if the two series have means that are not trending, a simple regression involving two independent \(\mathrm I(1)\) series will often result in a significant \(t\) statistic. To be more precise, let \(\{x_t\}\) and \(\{y_t\}\) be random walks generated by

\[ x_t = x_{t-1} + a_t, \; t=2,3,\ldots, \] and \[ y_t = y_{t-1} + e_t, \; t=2,3,\ldots, \] where \(\{a_t\}\) and \(\{e_t\}\) are iid innovations, with mean zero and variances \(\sigma^2_a\) and \(\sigma^2_e,\) respectively.

Assume further that \(\{a_t\}\) and \(\{e_t\}\) are independent processes. This implies that \(\{x_t\}\) and \(\{y_t\}\) are also independent. But if we run the simple regression

\[ y_t = \beta_0 + \beta_1 x_t + u_t \] and obtain the usual \(t\) statistic for \(\hat{\beta}_1\) and the usual R-squared, the results cannot be interpreted as if the variables are stationary (\(\mathrm I(0)\)).

  • The \(t\) statistic will be significant a large percentage of the time, much larger than the nominal significance level. \(\rightarrow\) This is called the spurious regression problem, where \(y_t\) and \(x_t\) are independent, but an OLS regression indicates a significant \(t\) statistic.

    For the \(t\) statistic of \(\hat{\beta}_1\) to have an approximate standard normal distribution in large samples, at a minimum, \(\{u_t\}\) should be a mean zero, serially uncorrelated process. But under \[ \mathrm H_0: \beta_1=0, y_t = \beta_0 + u_t \] the assumption was violated.

    To see this, we assume the initial value \(y_1=0\), then under \(\mathrm H_0\) we have \(\beta_0=0,\) and \[ u_t = y_t = \sum_{j=1}^t e_j. \] In other words, \(\{u_t\}\) is a random walk under \(\mathrm H_0\) which clearly violate even the asymptotic version of the Gauss-Markov assumptions.

  • Including a time trend does NOT necessarily change the conclusion.

    If \(y_t\) or \(x_t\) is a random walk with drift and a time trend is NOT included, then spurious regression problem is even worse.

  • The behavior of R-squared is nonstandard.

    In regressions with \(\mathrm I(0)\) time series variables, the R-squared converges in probability to the population R-squared: \(1-\sigma^2_u/\sigma^2_y.\)

    But in regressions with \(\mathrm I(1)\) processes, rather than the R-squared having a well-defined plim, it actually converges to a random variable. Formalizing this notion is well beyond the scope of this text.

    The implication is that the R-squared is large with high probability, even though \(y_t\) and \(x_t\) are independent time series processes.

The same considerations arise with multiple independent variables, each of which may be \(\mathrm I(1)\) or some of which may be \(\mathrm I(0)\). If \(y_t\) is \(\mathrm I(1)\) and at least some of the explanatory variables are \(\mathrm I(1)\), the regression results may be spurious.

This does not mean that regressions involving any \(\mathrm I(1)\) process are problematic. There are remedies for using the \(\mathrm I(1)\) processes:

  • The first difference of an \(\mathrm I(1)\) process can be used to transform it to be weakly dependent (and often stationary) – Error Correction Model.

    For example \[ \Delta y_t = \mu + \beta_1 \Delta x_t + \delta (y_{t-1} - \theta x_{t-1}) + \varepsilon_t, \]

    where \(\delta<0.\)

    See Section 18.4 for more details on error correction models.

  • Regressing an \(\mathrm I(1)\) dependent variable on an \(\mathrm I(1)\) independent variable can be informative, but only if these variables contain a common stochastic trend, that is, the series are said to be Cointegrated.

17.4.1 Real World Example

Example 1

We have two time series:

  • UNRATE: the U.S. unemployment rate.
  • LN_JAP_IP: logged Japanese industrial production
library(tidyverse)
library(readxl)
data <- read_xlsx("data/spurious_reg.xlsx")
colnames(data)[1] <- "year-quarter"
data
# A tibble: 232 × 3
   `year-quarter`   UNRATE LN_JAP_IP
   <chr>             <dbl>     <dbl>
 1 1960:01        5.133333  2.500159
 2 1960:02        5.233333  2.539725
 3 1960:03        5.533333  2.581925
 4 1960:04        6.266667  2.628351
 5 1961:01        6.8       2.682107
 6 1961:02        7         2.720609
 7 1961:03        6.766667  2.764591
 8 1961:04        6.2       2.798435
 9 1962:01        5.633333  2.831171
10 1962:02        5.533333  2.831171
# ℹ 222 more rows
qtr_index <- as.yearqtr(data$`year-quarter`, format = "%Y:%q")
qtr_xts <- xts(data[c(2, 3)], qtr_index)
autoplot(qtr_xts, facets = Series ~ .) +
    facet_grid(Series ~ ., scales = "free_y") +
    geom_vline(
        xintercept = as.yearqtr("1985 Q4"),
        linetype = "dashed"
    ) +
    scale_color_manual(values = c("UNRATE" = "darkred", "LN_JAP_IP" = "#3B4992FF")) +
    scale_x_yearqtr(format = "%Y-Q%q") +
    theme(legend.position = "none")

data1 <- data %>% filter(`year-quarter` <= "1985:4")
lm(UNRATE ~ LN_JAP_IP, data = data1) %>% summary()

Call:
lm(formula = UNRATE ~ LN_JAP_IP, data = data1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6120 -1.1004  0.0557  0.8909  3.7861 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.4576     1.0096   0.453    0.651    
LN_JAP_IP     1.5148     0.2678   5.656 1.43e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.5 on 102 degrees of freedom
Multiple R-squared:  0.2387,    Adjusted R-squared:  0.2313 
F-statistic: 31.98 on 1 and 102 DF,  p-value: 1.425e-07

Run the regression on data from 1960 to 1985, we obtain the following estimation.

\[ \begin{aligned} \widehat{\text{UNRATE}}_t &= 0.46 + 1.51 \times \text{LN\_JAP\_IP}, R^2 = 0.24 \\ &\phantom{=} \;(1.01) \;\; (0.27) \end{aligned} \tag{17.6}\]

The \(t\)-statistic on the slope coefficient is 5.66, which shows statistical significant positive relationship.

data2 <- data %>% filter(`year-quarter` > "1985:4")
lm(UNRATE ~ LN_JAP_IP, data = data2) %>% summary()

Call:
lm(formula = UNRATE ~ LN_JAP_IP, data = data2)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8231 -0.9812 -0.3774  0.6837  3.8549 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   42.371      7.745   5.471 2.32e-07 ***
LN_JAP_IP     -7.923      1.686  -4.698 6.76e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.363 on 126 degrees of freedom
Multiple R-squared:  0.1491,    Adjusted R-squared:  0.1423 
F-statistic: 22.07 on 1 and 126 DF,  p-value: 6.756e-06

Running the regression using data from 1986 to 2017, we obtain

\[ \begin{aligned} \widehat{\text{UNRATE}}_t &= 42.37 - 7.92 \times \text{LN\_JAP\_IP}, R^2 = 0.15 \\ &\phantom{=} \;(7.74) \;\; (1.69) \end{aligned} \tag{17.7}\]

which shows significant negative relationship btw the two series.

Q: Why Equation 17.6 and Equation 17.7 generate so different yet both significant coefficient estimates?
A: The source of these conflicting results is that both series have stochastic trends. These trends happened to align from 1960 through 1985 but were reversed from 1986 through 2017. There is, in fact, no compelling economic or political reason to think that the trends in these two series are related. In short, these regressions are spurious.

Main Takeaway: OLS can be misleading when the series contain stochastic trends. More generally, if \(\{y_t\}\) is \(\mathrm I(1)\) and at least some of the explanatory variables are \(\mathrm I(1)\), the regression results may be spurious.


Example 2

Figure 17.3 plotted logs of consumption and GDP (aka Income) for 1950.Q1 to 2000.Q4.

  • REALGDP: \(\ln \text{GDP}\)
  • REALCONS: \(\ln \text{Cons}\)

Both variables are obviously nonstationary and appear \(\mathrm I(1)\). The figure suggests a highly significant relationship but the relationship is due to the fact that both series are trending over time. The regression results are spurious.

# data from 1950Q1 to 2000Q4
data <- read_csv("data/Greene_consump_income.csv")
data <- data %>%
    mutate(
        REALGDP = log(REALGDP),
        REALCONS = log(REALCONS),
        year_quarter = paste(YEAR, QTR, sep = "-")
    )
qtr_index <- as.yearqtr(data$year_quarter, format = "%Y-%q")
qtr_xts <- xts(data[c("REALGDP", "REALCONS")], qtr_index)
autoplot(qtr_xts, facets = NULL) +
    aes(linetype = Series) +
    scale_color_manual(values = c("REALGDP" = "black", "REALCONS" = "black")) +
    scale_linetype_manual(values = c("REALGDP" = "solid", "REALCONS" = "dashed")) +
    scale_x_yearqtr(format = "%Y-Q%q") +
    labs(y = "Logs") + 
    theme(
        legend.position = c(0.1, 0.8),
        legend.title = element_blank(),
        axis.title.x = element_blank(),
    )
Figure 17.3

Unrestricted model:

\[ y_t = \mu + \gamma_1 y_{t-1} + \beta_0 x_t + \beta_1 x_{t-1} + \varepsilon_t, \]

The Error Correction Model:

\[ \Delta y_t = \mu + \beta_0 \Delta x_{t} + (\gamma_1 - 1)(y_{t_1} - \theta x_{t-1}) + \varepsilon_t. \]

  • Equilibrium relationship: \(\Delta y_t = \mu + \beta_0 \Delta x_{t} + \varepsilon_t\), where \(\varepsilon_t\) is the equilibrium error.
  • Error correction term: \((\gamma_1 - 1)(y_{t_1} - \theta x_{t-1})\)

The standard approach is the Engle-Granger two-step method:

  1. Run the cointegrating regression of \(y_t\) on \(x_t\) and save the residuals, \(\hat{s}_t = y_t - \hat{\theta} x_t\).
  2. Run the regression of \(\Delta y_t\) on \(\Delta x_t\) and \(\hat{s}_{t-1}\) (the lagged residual is the error-correction term \(y_{t-1}-\hat{\theta}x_{t-1}\)).
# Step 1: cointegrating regression
coint_reg <- lm(REALCONS ~ REALGDP, data = data)
data$ecm_term <- residuals(coint_reg) # v_hat_t = y_t - theta*x_t
summary(coint_reg)

Call:
lm(formula = REALCONS ~ REALGDP, data = data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.038885 -0.010280  0.000642  0.009570  0.065623 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.904107   0.019974  -45.26   <2e-16 ***
REALGDP      1.056765   0.002399  440.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01667 on 202 degrees of freedom
Multiple R-squared:  0.999, Adjusted R-squared:  0.999 
F-statistic: 1.941e+05 on 1 and 202 DF,  p-value: < 2.2e-16

The cointegrating regression shows that the estimated cointegrating parameter is 1.057, given by

\[ \begin{aligned} \widehat{\ln\text{Cons}}_t &= -0.9 + 1.057 \times \ln\text{GDP}_t, \\ &\phantom{=} \qquad\qquad (0.002) \end{aligned} \]

which captures the long-run relationship between consumption and income.

Hence, the residuals from the cointegrating regression can be obtained by

\[ \hat{s}_t = \ln\text{Cons}_t - (-0.9 + 1.057 \times \ln\text{GDP}_t) . \]

# Plot cointegrating regression residuals
ecm_xts <- xts(data$ecm_term, qtr_index)
autoplot(ecm_xts) +
    geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
    labs(x = NULL, y = NULL, title = "Cointegrating Regression Residuals") +
    scale_x_yearqtr(format = "%Y-Q%q") +
    theme_minimal(base_size = 12) +
    theme(legend.position = "none")

The residuals shows the absence of a time trend. But, it remains to verify whether the series of residuals is stationary.

We use ADF test to check the stationarity of the residuals.

library(urca)
adf_res <- ur.df(data$ecm_term, type = "drift", lags = 1)  # no intercept
summary(adf_res)

############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression drift 


Call:
lm(formula = z.diff ~ z.lag.1 + 1 + z.diff.lag)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.041195 -0.004861 -0.000272  0.004721  0.021621 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0003257  0.0005817  -0.560    0.576    
z.lag.1     -0.1615118  0.0370205  -4.363 2.06e-05 ***
z.diff.lag  -0.0985225  0.0677316  -1.455    0.147    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.008261 on 199 degrees of freedom
Multiple R-squared:   0.11, Adjusted R-squared:  0.1011 
F-statistic:  12.3 on 2 and 199 DF,  p-value: 9.186e-06


Value of test-statistic is: -4.3628 9.6418 

Critical values for test statistics: 
      1pct  5pct 10pct
tau2 -3.46 -2.88 -2.57
phi1  6.52  4.63  3.81

The ADF regression of the least squares residuals on a constant (random walk with drift), the lagged value, and the lagged first difference shows that

\[ \widehat{\Delta \hat{s}}_t = -0.0003 - 0.162 \hat{s}_{t-1} -0.099 \Delta \hat{s}_{t-1}, \]

the \(t\)-statistic on the lagged level of the residuals is \(-0.162,\) which is statistically significant at the 5% level, and the ADF test statistics are \(DF_{\tau} = -4.36\) and \(DF_{\gamma} = 9.64\), which are both significant at the 1% level. This suggests that the residuals are stationary, and thus, there is a long-run relationship between consumption and income. We conclude that the two series are cointegrated, and we can proceed to estimate the error correction model.

# Step 2: ECM
data <- data %>%
    mutate(
        d_REALCONS = REALCONS - lag(REALCONS),
        d_REALGDP  = REALGDP - lag(REALGDP),
        ecm_lag    = lag(ecm_term) # s_hat_{t-1}
    )

ecm_fit <- lm(d_REALCONS ~ d_REALGDP + ecm_lag, data = data)
summary(ecm_fit)

Call:
lm(formula = d_REALCONS ~ d_REALGDP + ecm_lag, data = data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.039858 -0.003276  0.000140  0.004081  0.028833 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.0037844  0.0006561   5.768 3.01e-08 ***
d_REALGDP    0.5823641  0.0510053  11.418  < 2e-16 ***
ecm_lag     -0.0951092  0.0305075  -3.118  0.00209 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.006925 on 200 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.3947,    Adjusted R-squared:  0.3886 
F-statistic:  65.2 on 2 and 200 DF,  p-value: < 2.2e-16

The estimated error correction model, with estimated standard errors in parentheses, is

\[ \begin{aligned} \widehat{\Delta \ln\text{Cons}_t} &= 0.004 + 0.582 \times \Delta \ln\text{GDP}_t + (0.905-1) \times (REALCONS - 1.057 \times REALGDP) \\ &\phantom{=} \;(0.00066) \;\; (0.05) \qquad\qquad\qquad\qquad (0.03) \end{aligned} \]

When the deviation is positive, multiplying it by \((\gamma_1 - 1) = -0.095\) gives a negative number, indicating a negative adjustment. This means that if consumption is above its equilibrium level, it will tend to decrease in the next period.

By contrast, if consumption is below its equilibrium level (negative deviation), it will tend to increase in the next period.

The estimated error correction term is statistically significant, which suggests that there is a long-run relationship between consumption and income, and that deviations from this long-run relationship are corrected over time.


References:

  • Chapter 18.3 Spurious Regression, Stock, James H., and Mark W. Watson. 2019. Introduction to Econometrics. 4th ed. Pearson Education Limited.