11.2 Lag and Lead
R loads the lag function from the stats package by default, which is designed for time series objects. It has a different behavior when applied to regular data frames or tibbles, and it may lead to unexpected results, such as returning the original vector instead of lagging it.
‼️ To avoid this issue, you should use dplyr::lag() instead of stats::lag(), especially when working with tibbles or data frames.
Apply default lag on a numeric vector: no errors but it does not shift the values as expected. → This is very dangerous because it does not throw an error, but it does not do what you expect. You have to check the output carefully to make sure it is correct.
stats::lag() on time series objects:
> ts_vector <- ts(1:10)
> ts_vector
Time Series:
Start = 1
End = 10
Frequency = 1
[1] 1 2 3 4 5 6 7 8 9 10
> lag(ts_vector, 1)
Time Series:
Start = 0
End = 9
Frequency = 1
[1] 1 2 3 4 5 6 7 8 9 10
> lag(ts_vector, -1)
Time Series:
Start = 2
End = 11
Frequency = 1
[1] 1 2 3 4 5 6 7 8 9 10💡 Note that stats:lag does NOT shift the actual data values. It changes the time index instead.
Default lag throws an error when applied to a tibble, which is good because it alerts you that something is wrong.
> df_tibble <- tibble(
year = 2000:2025,
value = seq(1:26)
)
> df_tibble %>% mutate(value_lag1 = stats::lag(value, 1))
# A tibble: 26 × 3
Error in split_decimal(x, sigfig = sigfig, digits = digits) :
invalid time series parameters specifiedDefault lag does NOT throw an error when applied to a regular data frame, but it does NOT shift the values as expected either.
> df <- data.frame(
year = 2000:2025,
value = seq(1:26)
)
> df %>%
mutate(value_lag1 = stats::lag(value, 1)) %>%
data.table::data.table() %>%
print(topn=5)
year value value_lag1
<int> <int> <int>
1: 2000 1 1
2: 2001 2 2
3: 2002 3 3
4: 2003 4 4
5: 2004 5 5
---
22: 2021 22 22
23: 2022 23 23
24: 2023 24 24
25: 2024 25 25
26: 2025 26 26Use dplyr::lag() to get the expected lagged values.
> dplyr::lag(1:5, 1)
[1] NA 1 2 3 4
# Create a lead variable
> dplyr::lead(1:5, 1)
[1] 2 3 4 5 NA
> df %>%
mutate(value_lag1 = dplyr::lag(value, 1)) %>%
data.table::data.table() %>%
print(topn = 5)
year value value_lag1
<int> <int> <int>
1: 2000 1 NA
2: 2001 2 1
3: 2002 3 2
4: 2003 4 3
5: 2004 5 4
---
22: 2021 22 21
23: 2022 23 22
24: 2023 24 23
25: 2024 25 24
26: 2025 26 25Note that dplyr::lag(x, n=1) does NOT take a negative n argument to create a lead variable. You have to use dplyr::lead() instead.
This is in contrast to stats::lag(x, k=1) which can take a negative k to create a lead variable.
stats::lag()is designed for time series objects, whiledplyr::lag()works with regular data frames and tibbles.Always use
dplyr::lag()when working with data frames or tibbles to avoid unexpected results.
You can rename dplyr::lag() to lag() to avoid typing dplyr:: every time.
11.2.1 Unbalanced panel
When you have unbalanced panel data, for instance, you have missing values for some time periods for some units, dplyr::lag() will ignore the time structure and just lag the values by row. This can lead to incorrect lagged values.
To deal with time gaps in unbalanced panel data, you can first make a balanced panel by filling in the missing time periods with NA values, and then apply dplyr::lag() to get the correct lagged values.
# Create an unbalanced panel data frame
df <- tibble(
firm = c(rep("A", 3), rep("B", 4)),
year = c(2018, 2019, 2021, 2018, 2019, 2020, 2021),
value = c(10, 12, 20, seq(1,4))
)
df
# A tibble: 7 × 3
firm year value
<chr> <dbl> <dbl>
1 A 2018 10
2 A 2019 12
3 A 2021 20
4 B 2018 1
5 B 2019 2
6 B 2020 3
7 B 2021 4
df_wrong <- df %>%
arrange(firm, year) %>%
group_by(firm) %>%
mutate(lag_value = dplyr::lag(value)) %>%
ungroup()
df_wrong
# A tibble: 7 × 4
firm year value lag_value
<chr> <dbl> <dbl> <dbl>
1 A 2018 10 NA
2 A 2019 12 10
3 A 2021 20 12
4 B 2018 1 NA
5 B 2019 2 1
6 B 2020 3 2
7 B 2021 4 3Firm A has a missing value for 2020, so the lagged value for 2021 is incorrect because it is lagging from 2019 instead of 2020.
To fix this, we create a balanced panel first
library(plm)
df_balanced <- df %>%
pdata.frame(index = c("firm", "year")) %>%
make.pbalanced()
df_balanced
# firm year value
# A-2018 A 2018 10
# A-2019 A 2019 12
# A-2020 A 2020 NA
# A-2021 A 2021 20
# B-2018 B 2018 1
# B-2019 B 2019 2
# B-2020 B 2020 3
# B-2021 B 2021 4Now we can apply dplyr::lag() to the balanced panel to get the correct lagged values.
df_balanced <- df_balanced %>%
group_by(firm) %>%
mutate(lag_value = dplyr::lag(value)) %>%
ungroup()
df_balanced
# A tibble: 8 × 4
firm year value lag_value
<fct> <fct> <dbl> <dbl>
1 A 2018 10 NA
2 A 2019 12 10
3 A 2020 NA 12
4 A 2021 20 NA
5 B 2018 1 NA
6 B 2019 2 1
7 B 2020 3 2
8 B 2021 4 3Notice that year becomes a factor variable after using pdata.frame(). You can convert it back to numeric if needed.
# convert year to numeric
df_balanced %>%
mutate(year = as.numeric(as.character(year)))
# A tibble: 8 × 4
firm year value lag_value
<fct> <dbl> <dbl> <dbl>
1 A 2018 10 NA
2 A 2019 12 10
3 A 2020 NA 12
4 A 2021 20 NA
5 B 2018 1 NA
6 B 2019 2 1
7 B 2020 3 2
8 B 2021 4 3