11.2 Lag and Lead

R loads the lag function from the stats package by default, which is designed for time series objects. It has a different behavior when applied to regular data frames or tibbles, and it may lead to unexpected results, such as returning the original vector instead of lagging it.

‼️ To avoid this issue, you should use dplyr::lag() instead of stats::lag(), especially when working with tibbles or data frames.

Apply default lag on a numeric vector: no errors but it does not shift the values as expected. → This is very dangerous because it does not throw an error, but it does not do what you expect. You have to check the output carefully to make sure it is correct.

> x <- 1:5
> lag(x, 1)
[1] 1 2 3 4 5
attr(,"tsp")
[1] 0 4 1

stats::lag() on time series objects:

> ts_vector <- ts(1:10)
> ts_vector
Time Series:
Start = 1 
End = 10 
Frequency = 1 
 [1]  1  2  3  4  5  6  7  8  9 10


> lag(ts_vector, 1)
Time Series:
Start = 0 
End = 9 
Frequency = 1 
 [1]  1  2  3  4  5  6  7  8  9 10

> lag(ts_vector, -1)
Time Series:
Start = 2 
End = 11 
Frequency = 1 
 [1]  1  2  3  4  5  6  7  8  9 10

💡 Note that stats:lag does NOT shift the actual data values. It changes the time index instead.

Default lag throws an error when applied to a tibble, which is good because it alerts you that something is wrong.

> df_tibble <- tibble(
  year = 2000:2025,
  value = seq(1:26) 
)
> df_tibble %>% mutate(value_lag1 = stats::lag(value, 1))
# A tibble: 26 × 3
Error in split_decimal(x, sigfig = sigfig, digits = digits) : 
  invalid time series parameters specified

Default lag does NOT throw an error when applied to a regular data frame, but it does NOT shift the values as expected either.

> df <- data.frame(
  year = 2000:2025,
  value = seq(1:26)
)
> df %>% 
    mutate(value_lag1 = stats::lag(value, 1)) %>% 
    data.table::data.table() %>% 
    print(topn=5)

     year value value_lag1
    <int> <int>      <int>
 1:  2000     1          1
 2:  2001     2          2
 3:  2002     3          3
 4:  2003     4          4
 5:  2004     5          5
---                       
22:  2021    22         22
23:  2022    23         23
24:  2023    24         24
25:  2024    25         25
26:  2025    26         26

Use dplyr::lag() to get the expected lagged values.

> dplyr::lag(1:5, 1)
[1] NA  1  2  3  4

# Create a lead variable
> dplyr::lead(1:5, 1)
[1]  2  3  4  5 NA

> df %>% 
    mutate(value_lag1 = dplyr::lag(value, 1)) %>%
    data.table::data.table() %>%
    print(topn = 5)

     year value value_lag1
    <int> <int>      <int>
 1:  2000     1         NA
 2:  2001     2          1
 3:  2002     3          2
 4:  2003     4          3
 5:  2004     5          4
---                       
22:  2021    22         21
23:  2022    23         22
24:  2023    24         23
25:  2024    25         24
26:  2025    26         25

Note that dplyr::lag(x, n=1) does NOT take a negative n argument to create a lead variable. You have to use dplyr::lead() instead. This is in contrast to stats::lag(x, k=1) which can take a negative k to create a lead variable.

stats::lag() is designed for time series objects, while dplyr::lag() works with regular data frames and tibbles.

Always use dplyr::lag() when working with data frames or tibbles to avoid unexpected results.

You can rename dplyr::lag() to lag() to avoid typing dplyr:: every time.

lag <- dplyr::lag

11.2.1 Unbalanced panel

When you have unbalanced panel data, for instance, you have missing values for some time periods for some units, dplyr::lag() will ignore the time structure and just lag the values by row. This can lead to incorrect lagged values.

To deal with time gaps in unbalanced panel data, you can first make a balanced panel by filling in the missing time periods with NA values, and then apply dplyr::lag() to get the correct lagged values.

# Create an unbalanced panel data frame
df <- tibble(
  firm = c(rep("A", 3), rep("B", 4)),
  year = c(2018, 2019, 2021, 2018, 2019, 2020, 2021),
  value = c(10, 12, 20, seq(1,4))
)
df
# A tibble: 7 × 3
  firm   year value
  <chr> <dbl> <dbl>
1 A      2018    10
2 A      2019    12
3 A      2021    20
4 B      2018     1
5 B      2019     2
6 B      2020     3
7 B      2021     4

df_wrong <- df %>%
  arrange(firm, year) %>%
  group_by(firm) %>%
  mutate(lag_value = dplyr::lag(value)) %>% 
  ungroup()
df_wrong
# A tibble: 7 × 4
  firm   year value lag_value
  <chr> <dbl> <dbl>     <dbl>
1 A      2018    10        NA
2 A      2019    12        10
3 A      2021    20        12
4 B      2018     1        NA
5 B      2019     2         1
6 B      2020     3         2
7 B      2021     4         3

Firm A has a missing value for 2020, so the lagged value for 2021 is incorrect because it is lagging from 2019 instead of 2020.

To fix this, we create a balanced panel first

library(plm)
df_balanced <- df %>%
  pdata.frame(index = c("firm", "year")) %>%
  make.pbalanced() 
df_balanced
#        firm year value
# A-2018    A 2018    10
# A-2019    A 2019    12
# A-2020    A 2020    NA
# A-2021    A 2021    20
# B-2018    B 2018     1
# B-2019    B 2019     2
# B-2020    B 2020     3
# B-2021    B 2021     4

Now we can apply dplyr::lag() to the balanced panel to get the correct lagged values.

df_balanced <- df_balanced %>%
  group_by(firm) %>%
  mutate(lag_value = dplyr::lag(value)) %>% 
  ungroup()
df_balanced
# A tibble: 8 × 4
  firm  year  value lag_value
  <fct> <fct> <dbl>     <dbl>
1 A     2018     10        NA
2 A     2019     12        10
3 A     2020     NA        12
4 A     2021     20        NA
5 B     2018      1        NA
6 B     2019      2         1
7 B     2020      3         2
8 B     2021      4         3

Notice that year becomes a factor variable after using pdata.frame(). You can convert it back to numeric if needed.

# convert year to numeric
df_balanced %>% 
  mutate(year = as.numeric(as.character(year)))
# A tibble: 8 × 4
  firm   year value lag_value
  <fct> <dbl> <dbl>     <dbl>
1 A      2018    10        NA
2 A      2019    12        10
3 A      2020    NA        12
4 A      2021    20        NA
5 B      2018     1        NA
6 B      2019     2         1
7 B      2020     3         2
8 B      2021     4         3