16 Vector Autoregressions
16.1 Multiple Equation Time Series Models
To motivate vector autoregressions, consider the autoregressive distributed lag (AR-DL) model for two series \(\by_t = (y_{1t}, y_{2t})'\) with one lag.
An AR-DL model for \(y_{1t}\):
\[ y_{1t} = a_{10} + a_{11} y_{1, t-1} + a_{12} y_{2, t-1} + \varepsilon_{1t} \]
An AR-DL model for \(y_{2t}\):
\[ y_{2t} = a_{20} + a_{21} y_{1, t-1} + a_{22} y_{2, t-1} + \varepsilon_{2t} \]
Each variable is a linear function of its own lag and the lag of the other variable. The two equations can be stacked together:
\[ \by_t = \bmu_0 + \bA_1 y_{t-1} + \bvarepsilon_t \tag{16.1}\]
Where:
\(\bmu_0 = (a_{10}, a_{20})'\) is \(2 \times 1\)
\(\bA_1\) is \(2 \times 2\), the coefficient matrix for the lagged variable \(\by_{t-1}\):
\[ \begin{aligned} \bA_1 \by_{t-1} = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} y_{1, t-1} \\ y_{2, t-1} \end{bmatrix} \end{aligned} \]
\(\bvarepsilon_t = (\varepsilon_{1t}, \varepsilon_{2t})'\)
This is a bivariate VAR(1) model for \(\by_t.\)
The vector representation is given by:
\[ \begin{bmatrix} y_{1t} \\ y_{2t} \end{bmatrix} = \begin{bmatrix}a_{10} \\ a_{20} \end{bmatrix} + \begin{bmatrix}a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix}y_{1, t-1} \\ y_{2, t-1} \end{bmatrix} + \begin{bmatrix}\varepsilon_{1t} \\ \varepsilon_{2t} \end{bmatrix} \]
The variance of the error term \(\bvarepsilon_t\) is given by
\[ \bSigma = \var \begin{bmatrix} \varepsilon_{1t} \\ \varepsilon_{2t} \end{bmatrix} = \begin{bmatrix}\sigma_{11} & \sigma_{12} \\ \sigma_{21} & \sigma_{22} \end{bmatrix} \]
Alternative representation of VAR(1) model (grouped by variables):
\[ \begin{bmatrix} \by_1 \\ \by_2 \end{bmatrix} = \begin{bmatrix}\bmu_1 \\ \bmu_2 \end{bmatrix} + \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \begin{bmatrix}\by_{1, -1} \\ \by_{2, -1} \end{bmatrix} + \begin{bmatrix}\bvarepsilon_{1} \\ \bvarepsilon_{2} \end{bmatrix} \tag{16.2}\]
where
\[ \begin{aligned} A_{11} &= I_n \times a_{11}, \quad A_{12} = I_n \times a_{12}, \\ A_{21} &= I_n \times a_{21}, \quad A_{22} = I_n \times a_{22} \end{aligned} \]
Note that Equation 16.2 is NOT the common representation. It is more common to group by time as in \(\by_t = \bmu_0 + \bA_1 y_{t-1} + \bvarepsilon_t.\)
General specification of \(\text{VAR}(p)\) model
This model extends naturally to more than two variables.
For \(p\)-lag and \(m\)-variable \(\by_t\), the \(\text{VAR}(p)\) model is:
\[ \by_t = \bmu_0 + \bA_1 \by_{t-1} + \bA_2 \by_{t-2} + \cdots + \bA_p \by_{t-p} + \bvarepsilon_t . \tag{16.3}\]
where:
- \(\by_t\) is a \(m \times 1\) vector of variables at time \(t\)
- \(\bA_\ell\) are \(m \times m\), for \(\ell = 1, \ldots, p\), autoregressive coefficient matrices
- \(\bvarepsilon_t\) is \(m \times 1\)
Matrix notation for \(A_\ell\) (the coefficient for the \(\ell\)-th lag, \(\by_{t-\ell}\)):
\[ A_\ell = \begin{bmatrix} a_{11,\ell} & a_{12,\ell} & \cdots & a_{1m,\ell} \\ a_{21,\ell} & a_{22,\ell} & \cdots & a_{2m,\ell} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1,\ell} & a_{m2,\ell} & \cdots & a_{mm,\ell} \end{bmatrix} \]
The individual equations for \(y_{mt},\) the \(m\)-th variable, can be written as
\[ \begin{aligned} y_{mt} &= \mu_m + \underbrace{ a_{m1,1} y_{1, t-1} + a_{m2,1} y_{2, t-1} + \cdots + a_{mm,1} y_{m, t-1} }_{\bA_1 \by_{t-1} \text{ terms}} \\ &\phantom{=} \quad + \underbrace{a_{m1,2} y_{1, t-2} + a_{m2,2} y_{2, t-2} + \cdots + a_{mm,2} y_{m, t-2}}_{\bA_2 \by_{t-2} \text{ terms}} \\ &\phantom{=} \quad + \cdots + \underbrace{a_{m1,p} y_{1, t-p} + a_{m2,p} y_{2, t-p} + \cdots + a_{mm,p} y_{m, t-p}}_{\bA_p \by_{t-p} \text{ terms}} + \varepsilon_{mt} \\ &= \mu_m + \underbrace{\sum_{\ell=1}^p [A_\ell]_{m1} y_{1, t-\ell} + \sum_{\ell=1}^p [A_\ell]_{m2} y_{2, t-\ell} + \cdots + \sum_{\ell=1}^p [A_\ell]_{mm} y_{m, t-\ell}}_{\text{Collect by components in }A_\ell,\, m\times m \times p \text{ terms}} + \varepsilon_{mt} \end{aligned} \]
where \([A_\ell]_{lm}\) is the \((l,m)\)-th element of the matrix \(A_\ell.\)
A major disadvantage of VAR models is that the number of parameters grows quadratically with the number of variables and linearly with the number of lags. For \(m\) variables and \(p\) lags, there are \(m\times m \times p\) parameters in the autoregressive coefficient matrices, and \(m\) parameters in the intercept vector \(\bmu_0.\)
The error term \(\bvarepsilon_t = (\varepsilon_{1t}, ..., \varepsilon_{mt})'\) is unforecastable at time \(t-1\) (i.e., no serial correlation), but the components may be contemporaneously correlated.
The contemporaneous covariance matrix
\[ \bSigma = \mathbb{E}[\bvarepsilon_t \bvarepsilon_t'] \]
is non-diagonal.
We can write the \(\text{VAR}(p)\) model Equation 16.3 using the lag operator notation as
\[ \begin{aligned} \by_t &= \bmu_0 + A(L) \by_t + \bvarepsilon_t \\ \left(\bI - \bA(L)\right) \by_t &= \bmu_0 + \bvarepsilon_t \\ \bGamma(L) \by_t &= \bmu_0 + \bvarepsilon_t \end{aligned} \tag{16.4}\]
where
\[ \bGamma(L) = \bI - \bA(L) = \bI_m - \bA_1 L - \bA_2 L^2 - \cdots - \bA_p L^p \]
The condition for stationary of the system can be expressed as a restriction on the roots of the characteristic polynomial \(\bGamma(L).\)
Theorem 16.1 (VAR stationarity) If all roots \(\lambda\) of \(\text{det}(\bGamma(L)=0)\) satisfy \(\abs{\lambda}>1,\) then the \(\text{VAR}(p)\) process \(\by_t\) is strictly stationary and ergodic.
If \(\by_t\) is covariance stationary, then it has the projection equation
\[ \by_t = \Pcal_{t-1} [\by_{t}] + \bvarepsilon_t. \]
The innovation \(\bvarepsilon_t\) satisfies \(\E(\bvarepsilon_t)=0,\) and \(\E[\bvarepsilon_{t-j} \bvarepsilon_t']=0\) for all \(j>0,\) and \(\bSigma = \E[\bvarepsilon_t \bvarepsilon_t'] < \infty .\) If \(\by_t\) is strictly stationary, then \(\bvarepsilon_t\) is strictly stationary.
16.1.1 Multivariate Wold Decomposition
With the stability condition, we have
\[ \begin{aligned} \by_t &= \left(\bI - \bA(L)\right)^{-1} (\bmu_0 + \bvarepsilon_t) \\ \by_t &= \left(\bI - \bA(L)\right)^{-1}\bmu + \sum_{j=0}^\infty \bPsi_j \bvarepsilon_{t-j} \\ \Rightarrow \by_t &= \bar{\by} + \sum_{j=0}^\infty \bPsi_j \bvarepsilon_{t-j} \\ \by_t &= \bar{\by} + \bPsi(L) \bvarepsilon_t \end{aligned} \tag{16.5}\]
where \(\bPsi(L) = \sum_{j=0}^\infty \bPsi_j L^j.\)
Equation 16.5 is the multivariate Wold decomposition of \(\by_t,\) also known as the moving average representation.
\(\bPsi(L)\) can be computed recursively using the relation \(\bPsi(L) = \bGamma(L)^{-1},\) where
\[ \bGamma(L) = \bI_m - \bA_1 L - \bA_2 L^2 - \cdots - \bA_p L^p \]
We follow the same procedure as in the inverse of lag polynomials to compute \(\bPsi(L)\) recursively.
\[ \begin{aligned} L^0:&\quad \bPsi_0 = \bI_m \\ L^1: &\quad \bPsi_1 - \bPsi_0\bA_1 = 0 \Longrightarrow \bPsi_1 = \bPsi_0\bA_1 = \bA_1 \\ L^2: &\quad \bPsi_2 - \bPsi_1\bA_1 - \bPsi_0\bA_2 = 0 \Longrightarrow \bPsi_2 = \bPsi_1\bA_1 + \bPsi_0\bA_2 = \bA_1^2 + \bA_2 \\ \vdots \\ L^p: &\quad \bPsi_p - \bPsi_{p-1}\bA_1 - \bPsi_{p-2}\bA_2 - \cdots - \bPsi_{1}\bA_{p-1} - \bPsi_{0}\bA_p = 0 \\ L^{p+1}: &\quad \bPsi_{p+1} - \bPsi_{p}\bA_1 - \bPsi_{p-1}\bA_2 - \cdots - \bPsi_{2}\bA_{p-1} - \bPsi_{1}\bA_p = 0 \\ \vdots \end{aligned} \]
More generally, for \(j\ge 1,\)
\[ \bPsi_j = \sum_{\ell=1}^j \bPsi_{j-\ell} \bA_\ell . \tag{16.6}\]
The \(i\)-th component of \(\by_t\) can be written as
\[ y_{it} = \bar{y}_i + \sum_{h=0}^\infty \bpsi_{ih}' \bvarepsilon_{t-h} = \bar{y}_i + \sum_{h=0}^\infty \sum_{j=1}^m \psi_{ih, j} \varepsilon_{j, t-h} \]
The \(i\)-th row in \(\bPsi_h\) corresponds to the coefficients of the lagged innovations \(\bvarepsilon_{t-h}\) in the \(i\)-th component of \(\by_t.\)
16.2 Impulse Response Function
Impulse Response Function (IRF) is defined as the change in \(\by_t\) due to a change in an innovation or shock.
Mathematically, the IRF at horizon \(h\) is given by
\[ \begin{aligned} \text{IRF}_{ij}(h) &= \frac{\partial}{\partial\varepsilon_{jt}} \Pcal_t [y_{i,t+h}]\\ &= \frac{\partial}{\partial \varepsilon_{jt}} \E(y_{i,t+h} \mid \Fcal_{t}) \end{aligned} \tag{16.7}\]
which captures the effect of a shock to the \(j\)-th variable at time \(t\) on the \(i\)-th variable at time \(t+h.\)
Using the multivariate Wold representation, we can express the IRF as
\[ \begin{aligned} \text{IRF}_{ij}(h) &= \frac{\partial}{\partial \varepsilon_{jt}} \E\left(\bar{y}_i + \sum_{k=0}^\infty [\bPsi_k]_{ij}\, \varepsilon_{i, t+h-k} \mid \Fcal_t\right) \\ &= \frac{\partial}{\partial \varepsilon_{jt}} \left( \bar{y}_i + \sum_{k=0}^\infty [\bPsi_k]_{ij} \, \E \left(\varepsilon_{i, t+h-k} \mid \Fcal_t\right) \right) \\ \end{aligned} \]
For \(k \ge h\), \(\E \left(\varepsilon_{t+h-k} \mid \Fcal_t\right) = \varepsilon_{t+h-k},\) and for \(k< h,\) \(\E \left(\varepsilon_{t+h-k} \mid \Fcal_t\right) = 0.\)
Hence, we have
\[ \begin{aligned} \text{IRF}_{ij}(h) &= \frac{\partial}{\partial \varepsilon_{jt}} \left( \sum_{k=h}^\infty [\bPsi_k]_{i\cdot}\, \varepsilon_{i, t+h-k} \right) \\ &= \frac{\partial}{\partial \varepsilon_{jt}} \left( \sum_{\ell=0}^\infty [\bPsi_{h+\ell}]_{i\cdot} \,\varepsilon_{i, t-\ell} \right) \\ &= \frac{\partial}{\partial \varepsilon_{jt}} \left( \begin{array}{l} \text{}\lbrack\bPsi_{h}\rbrack_{i1}\varepsilon_{1, t} + [\bPsi_{h}]_{i2}\varepsilon_{2, t} + \cdots + [\bPsi_{h}]_{im}\varepsilon_{m, t} \\ +\lbrack \bPsi_{h+1}\rbrack_{i1} \varepsilon_{1, t-1} + [\bPsi_{h+1}]_{i2}\varepsilon_{2, t-1} + \cdots + [\bPsi_{h+1}]_{im}\varepsilon_{m, t-1} \\ +\lbrack \bPsi_{h+2}\rbrack_{i1} \varepsilon_{1, t-2} + [\bPsi_{h+2}]_{i2}\varepsilon_{2, t-2} + \cdots + [\bPsi_{h+2}]_{im}\varepsilon_{m, t-2} \\ +\cdots \end{array} \right) \end{aligned} \]
where \([\bPsi_k]_{i\cdot}\) is the \(i\)-th row of the matrix \(\bPsi_k,\) and \([\bPsi_h]_{ij}\) is the \((i,j)\)-th element of the matrix \(\bPsi_h.\)
Taking the derivative, the only term that survives is the \((i,j)\)-th element of the coefficient matrix \(\bPsi_h:\)
\[ \text{IRF}_{ij}(h) = [\bPsi_h]_{ij}, \]
which captures the effect of a shock to the \(j\)-th variable at time \(t\) on the \(i\)-th variable at time \(t+h.\)
We often plot IRF as a function of the horizon \(h\) for each pair \((i, j),\) it shows how the \(i\)-th variable responds over time to the \(j\)-th innovation.
Note that in a linear VAR model,
The IRF is symmetric in negative and positive innovations.
That is, the impact on \(y_{i, t+h}\) of a positive innovation \(\varepsilon_{j,t} = 1\) is \(\text{IRF}_{ij}(h),\) and the impact of a negative innovation \(\varepsilon_{j,t} = -1\) is \(-\text{IRF}_{ij}(h).\)
The magnitude of the impact is linear in the magnitude of the innovation.
The impact of the innovation \(\varepsilon_{j,t} = 2\) is \(2\times \text{IRF}_{ij}(h).\)
This means that the shape of the impulse response function is unaffected by the magnitude of the innnovation. Be aware that theses are consequences of the linearity of the VAR model, not necessarily features of the true world.
16.2.1 Cumulative Impulse Response Function
CIRF (Cumulative Impulse Response Function) is the accumulated (summed) response on \(y_t\) from time \(t\) to \(t+h:\)
\[ \begin{aligned} \text{CIRF}(h) &= \sum_{\ell=1}^h \frac{\partial}{\partial \bvarepsilon_t'} \Pcal[\by_{t+\ell}] \\ &= \sum_{\ell=1}^h \text{IRF}(\ell) \\ &= \sum_{\ell=1}^h [\bPsi_\ell] \end{aligned} \tag{16.8}\]
The limit of \(\text{CIRF}(h)\) as \(h\to\infty\) is the long-run impulse response matrix:
\[ \bC = \lim_{h\to\infty} \text{CIRF}(h) = \sum_{\ell=1}^\infty [\bPsi_\ell] = \bPsi(1) \]
To show the last equality:
\[ \bPsi(L) = \sum_{\ell=0}^\infty \bPsi_\ell L^\ell \]
Plug in \(L=1,\) we have
\[ \bPsi(1) = \sum_{\ell=0}^\infty \bPsi_\ell . \]
16.2.2 Orthogonized Impulse Response Function
Since innovations \(\bvarepsilon_t\) are contemporaneously correlated in a VAR model, it makes it difficult to interpret the IRF. We cannot describe the impact of a shock of \(\varepsilon_{jt}\) while “holding” \(\varepsilon_{st}\) constant for \(s\neq j,\) because \(\varepsilon_{jt}\) and \(\varepsilon_{st}\) are correlated.
That’s why we often introduce the orthogonalized impulse response function, which is based on the orthogonalized innovations so that the shocks are uncorrelated. The orthogonalized innovations are obtained by applying the Cholesky decomposition to the covariance matrix \(\bSigma = \E(\bvarepsilon_t \bvarepsilon_t')\) of the innovations \(\bvarepsilon_t.\)
We can factor \(\bSigma\) as
\[ \bSigma = \bP \bP', \]
where \(\bP\) is a lower triangular matrix, called a “square root” of \(\bSigma.\)
\(\bP\) can be obtained by the Cholesky decomposition of \(\bSigma:\)
\[ \bP = \text{chol}(\bSigma) . \]
Let \(m=3,\) then we have
\[ \bP = \begin{bmatrix} p_{11} & 0 & 0 \\ p_{21} & p_{22} & 0 \\ p_{31} & p_{32} & p_{33} \end{bmatrix} = \begin{bmatrix} \bp_1 & \bp_2 & \bp_3 \end{bmatrix} \]
with non-negative diagonal elements.
Then we can define the Orthogonalized Innovations, \(\tilde{\bvarepsilon}_t,\) as
\[ \begin{aligned} \tilde{\bvarepsilon}_t &= \bP^{-1} \bvarepsilon_t , \\ \Rightarrow \bvarepsilon_t &= \bP \tilde{\bvarepsilon}_t . \end{aligned} \]
We can rewrite \(\bvarepsilon_t\) as a linear combination of the orthogonalized innovations \(\tilde{\bvarepsilon}_t\):
\[ \bvarepsilon_t = \bp_1 \tilde{\varepsilon}_{1t} + \bp_2 \tilde{\varepsilon}_{2t} + \cdots + \bp_m \tilde{\varepsilon}_{mt} , \]
where \(\bp_j\) is the \(j\)-th column of \(\bP.\)
\(\tilde{\bvarepsilon}_t\) has mean zero and covariance matrix
\[ \begin{aligned} \E(\tilde{\bvarepsilon}_t \tilde{\bvarepsilon}_t') &= \E(\bP^{-1} \bvarepsilon_t \bvarepsilon_t' \bP'^{-1}) \\ &= \bP^{-1} \E(\bvarepsilon_t \bvarepsilon_t') \bP'^{-1} \\ &= \bP^{-1} \bSigma \bP'^{-1} \\ &= \bI_m . \end{aligned} \]
That is, the elements \(\tilde{\bvarepsilon}_t = (\tilde{\varepsilon}_{1t}, \ldots, \tilde{\varepsilon}_{mt})\) are mutually uncorrelated.
Now we can define the Orthogonalized Impulse Response Function (OIRF) as
\[ \text{OIRF}(h) = \frac{\partial}{\partial \tilde{\bvarepsilon}_t'} \Pcal_t [\by_{t+h}] = \frac{\partial}{\partial \tilde{\bvarepsilon}_t'} \E(\by_{t+h} \mid \Fcal_t) . \tag{16.9}\]
Given \(\bvarepsilon_t = \bP\tilde{\bvarepsilon}_t,\), we can rewrite the multivariate Wold representation as
\[ \begin{aligned} \by_t &= \bar{\by} + \sum_{j=0}^\infty \bPsi_j \bvarepsilon_{t-j} \\ &= \bar{\by} + \sum_{j=0}^\infty \bPsi_j \bP \tilde{\bvarepsilon}_{t-j} . \end{aligned} \]
Taking the derivative, we have
\[ \text{OIRF}(h) = \bPsi_h \bP = \text{IRF}(h) \bP , \]
which is the non-orthogonalized IRF multiplied by the Cholesky factor \(\bP.\)
Each \((i,j)\)-element of \(\text{OIRF}(h)\) can be obtained as \[ \text{OIRF}_{ij}(h) = [\bPsi_h \bP]_{ij} = \sum_{s=1}^m \text{IRF}_{is}(h) p_{sj} , \]
where \(\bPsi_h\) can be written as a matrix of row vectors:
\[ \bPsi_h = \begin{bmatrix} \bpsi_{1h}' \\ \bpsi_{2h}' \\ \vdots \\ \bpsi_{mh}' \end{bmatrix}, \]
and \(\bP\) can be written as a matrix of column vectors:
\[ \bP = \begin{bmatrix} \bp_1, \bp_2, \ldots, \bp_m \end{bmatrix} \]
Then the \((i,j)\)-element of \(\text{OIRF}(h)\) is given by
\[ \text{OIRF}_{ij}(h) = \bpsi_{ih}' \bp_{j} = \sum_{s=1}^m \psi_{ih, s} p_{sj} , \]
where \(\bp_{j}\) is the \(j\)-th column of \(\bP.\)
Taking \(m=3\) as an example,
\[ \text{OIRF}(h) = \bPsi_h \bP = \begin{bmatrix} \bpsi_{1h}' \\ \bpsi_{2h}' \\ \bpsi_{3h}' \end{bmatrix} \begin{bmatrix} \bp_1 & \bp_2 & \bp_3 \end{bmatrix} = \begin{bmatrix} \bpsi_{1h}' \bp_1 & \bpsi_{1h}' \bp_2 & \bpsi_{1h}' \bp_3 \\ \bpsi_{2h}' \bp_1 & \bpsi_{2h}' \bp_2 & \bpsi_{2h}' \bp_3 \\ \bpsi_{3h}' \bp_1 & \bpsi_{3h}' \bp_2 & \bpsi_{3h}' \bp_3 \end{bmatrix} \]
The Cumulative Orthogonalized Impulse Response Function (COIRF) is given by
\[ \text{COIRF}(h) = \sum_{\ell=1}^h \text{OIRF}(\ell) = \sum_{\ell=1}^h \bPsi_\ell \bP = \text{CIRF}(h) \bP . \]
16.3 Regression Notation
The \(\text{VAR}(p)\) system of equations can be written as
\[ \by_t = \bA'\bx_{t} + \bvarepsilon_t , \]
where \(\bx_t\) is the \(\left(mp+1\right)\times 1\) vector of regressors:
\[ \underbrace{\bx_t}_{\left(mp+1\right)\times 1} = \begin{pmatrix} 1 \\ \by_{t-1} \\ \by_{t-2} \\ \vdots \\ \by_{t-p} \end{pmatrix} \]
and
\[ \begin{split} \underbrace{\bA'}_{m \times (mp+1)} &= \left(\begin{array}{c|cccc|cccc|c|cccc} a_{01} & a_{11,1} & a_{12,1} & \cdots & a_{1m,1} & a_{11,2} & a_{12,2} & \cdots & a_{1m,2} & \cdots & a_{11,p} & a_{12,p} & \cdots & a_{1m,p} \\ a_{02} & a_{21,1} & a_{22,1} & \cdots & a_{2m,1} & a_{21,2} & a_{22,2} & \cdots & a_{2m,2} & \cdots & a_{21,p} & a_{22,p} & \cdots & a_{2m,p} \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ a_{0m} & a_{m1,1} & a_{m2,1} & \cdots & a_{mm,1} & a_{m1,2} & a_{m2,2} & \cdots & a_{mm,2} & \cdots & a_{m1,p} & a_{m2,p} & \cdots & a_{mm,p} \end{array}\right) \\ &= \begin{pmatrix} \bmu_0 & \bA_1' & \bA_2' & \cdots & \bA_p' \end{pmatrix} \end{split} \]
The error has the covariance matrix
\[ \bSigma = \mathbb{E}[\bvarepsilon_t \bvarepsilon_t'] \]
We can also write the coefficient matrix as
\[ \bA = (\balpha_1,\balpha_2,\dots,\balpha_m) \]
where \(\balpha_j\) is the \(j\)th column of \(\bA\): the vector of coefficients for the \(j\)th variable in the \(\text{VAR}(p)\) model.
Thus
\[ \balpha_j = \begin{pmatrix} \mu_j \\ a_{j1,1} \\ a_{j2,1} \\ \vdots \\ a_{jm,1} \\ a_{j1,2} \\ a_{j2,2} \\ \vdots \\ a_{jm,2} \\ \vdots \\ a_{j1,p} \\ a_{j2,p} \\ \vdots \\ a_{jm,p} \end{pmatrix}, \quad j=1,\ldots,m \]
and
\[ y_{jt} = \balpha_j' \bx_t + \varepsilon_{jt} \]
16.4 Estimation
The systems estimator of a multivariate regression is least squares:
\[ \hat{\bA} = \left(\sum_{t=1}^n \bx_t \bx_t'\right)^{-1} \left(\sum_{t=1}^n \bx_t \by_t\right) \]
Alternatively, the coefficient estimator for the \(j\)th equation is
\[ \hat{\balpha}_j = \left(\sum_{t=1}^n \bx_t \bx_t'\right)^{-1} \left(\sum_{t=1}^n \bx_t y_{jt}\right) \]
The least squares residual vector is
\[ \hat{\bvarepsilon}_t = \by_t - \hat{\bA}' \bx_t \]
The covariance matrix of the residuals is estimated as
\[ \hat{\bSigma} = \frac{1}{n-mp-1} \sum_{t=1}^n \hat{\bvarepsilon}_t \hat{\bvarepsilon}_t' \]
16.4.1 Statistical Properties
If \(\by_t\) is strictly stationary and ergodic with finite variances then we can apply the Ergodic Theorem to deduce that
\[ \frac{1}{n} \sum_{t=1}^n \bx_t \bx_t' \xrightarrow{p} \E\left(\bx_t\bx_t'\right) = \bQ \]
and
\[ \frac{1}{n} \sum_{t=1}^n \bx_t \by_t \xrightarrow{p} \E\left(\bx_t \by_t\right) . \]
Hence, we conclude that \(\hat{\bA}\) is consistent for \(\bA\):
\[ \hat{\bA} \xrightarrow{p} \bA \quad \text{as } n\to\infty. \]
\(\hat{\bSigma}\) is consistent for \(\bSigma\):
\[ \hat{\bSigma} \xrightarrow{p} \bSigma \quad \text{as } n\to\infty. \]
VAR models can be estimated in Stata using the var command.
16.5 Asymptotic Distribution
Set
\[ \balpha = \text{vec}(\bA) = \begin{pmatrix} \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_m \end{pmatrix}, \quad \hat{\balpha} = \text{vec}(\hat{\bA}) = \begin{pmatrix} \hat{\alpha}_1 \\ \hat{\alpha}_2 \\ \vdots \\ \hat{\alpha}_m \end{pmatrix} \]
Theorem 16.2 (VAR asymptotic distribution) If \(\by_t\) follows the \(\text{VAR}(p)\) model with \(\E(\bvarepsilon_t\mid \Fcal_{t-1})=0,\) then as \(n\to\infty,\)
\[ \sqrt{n}(\hat{\balpha} - \balpha) \xrightarrow{d} \mathcal{N}(0, \bV) \]
where
\[ \begin{split} \bV &= \overline{\bQ}^{-1} \bOmega\, \overline{\bQ}^{-1} \\ \overline{\bQ} &= \bI_m \otimes \bQ \\ \bQ &= \E\left(\bx_t \bx_t'\right) \\ \bOmega &= \E \left(\bvarepsilon_t\bvarepsilon_t' \otimes \bx_t \bx_t' \right) \end{split} \]
\(\E(\bvarepsilon_t\mid \Fcal_{t-1})=0\) means that the innovation is a martingale difference sequence, i.e., the error term is unforecastable at time \(t-1\).
This means that this distributional result assumes that the \(\text{VAR}(p)\) model is the correct conditional mean for each variable. In words, these are the correct lags and there is no omitted nonlinearity.
16.5.1 Conditional Homoskedasticity
If we further strengthen the MDS assumption and assume that \(\bvarepsilon_t\) is conditionally homoskedastic, i.e.,
\[ \E(\bvarepsilon_t \bvarepsilon_t' \mid \Fcal_{t-1}) = \bSigma , \tag{16.10}\]
then the asymptotic variance simplifies to
\[ \begin{split} \bOmega &= \bSigma \otimes \bQ \\ \bV &= \bSigma \otimes \bQ^{-1} \end{split} \]
Theorem 16.3 (VAR asymptotic distribution homoskedastic) If \(\by_t\) is strictly stationary and ergodic, and \(\E(\bvarepsilon_t\bvarepsilon_t'\mid \Fcal_{t-1})=\bSigma,\) then as \(n\to\infty,\)
\[ \sqrt{n}(\hat{\balpha} - \balpha) \xrightarrow{d} \mathcal{N}(0, \bV) \]
where
\[ \begin{split} \bV &= \overline{\bQ}^{-1} \bOmega\, \overline{\bQ}^{-1} \\ \overline{\bQ} &= \bI_m \otimes \bQ \\ \bQ &= \E\left(\bx_t \bx_t'\right) \\ \bOmega &= \sum_{\ell=-\infty}^\infty \left(\bvarepsilon_{t-\ell}\bvarepsilon_t' \otimes \bx_{t-\ell} \bx_t' \right) \end{split} \]
16.6 Covariance Matrix Estimation
The classical homoskedastic estimator of the covariance matrix for \(\hat{\balpha}\) equals
\[ \begin{split} \hat{\bV}^0_{\hat{\balpha}} &= \hat{\bSigma} \otimes \left(\sum_{t=1}^n \bx_t \bx_t'\right)^{-1} \\ &= \hat{\bSigma} \otimes (\bX'\bX)^{-1} . \end{split} \]
The heteroskedasticity-robust covariance matrix estimator is given by
\[ \begin{split} \hat{\bV}_{\hat{\balpha}} &= \left(\bI_n \otimes\sum_{t=1}^n \bx_t \bx_t'\right)^{-1} \left(\sum_{t=1}^n \hat{\bvarepsilon}_t \hat{\bvarepsilon}_t' \otimes \bx_t \bx_t' \right) \left(\bI_n \otimes\sum_{t=1}^n \bx_t \bx_t'\right)^{-1} \\ &= (\bI_n \otimes \bX'\bX)^{-1} \left(\sum_{t=1}^n \hat{\bvarepsilon}_t \hat{\bvarepsilon}_t' \otimes \bx_t \bx_t' \right) (\bI_n \otimes \bX'\bX)^{-1} \end{split} \]
The Newey-West covariance matrix estimator is given by
\[ \begin{split} \hat{\bV}_{\hat{\balpha}} &= (\bI_n \otimes \bX'\bX)^{-1} \hat{\bOmega}_M (\bI_n \otimes \bX'\bX)^{-1} \end{split} \]
where
\[ \begin{split} \hat{\bOmega}_M &= \sum_{\ell=-M}^M w_\ell \left(\sum_{t=1+\ell}^n \bx_{t-\ell}\hat{\varepsilon}_{t-\ell}\bx_t'\hat{\varepsilon}_t \right) \\ w_l &= 1-\frac{\abs{\ell}}{M+1} \text{ for } \ell = -M, \ldots, M. \end{split} \]
The number \(M\) is called the lag truncation number. An unweighted estimator is obtained by setting \(w_\ell = 1\) for all \(\ell.\) The Newey-West estimator is a special case of the heteroskedasticity-robust covariance matrix estimator, where the weights are chosen to account for serial correlation in the errors. The Newey-West estimator does not require that the \(\text{VAR}(p)\) is correctly specified.
Caveat: Asymptotic approximations tend to be much less accurate under time series dependence than for independent observations.
Mitigation: Therefore bootstrap methods are popular. The most common bootstrap methods are, for example: Recursive Residuals, Moving Block, and Stationary Bootstrap. These methods are designed to preserve the time series structure of the data while allowing for resampling. Refer to Section 14.45 Bootstrap for Time Series in Econometrics by Bruce Hansen for more details.
16.7 IRF Estimation
Given the estimated VAR model and the estimated coefficient matrices \(\hat{\bA}_\ell,\) we can obtain the impulse response estimator based on Equation 16.6 as
\[ \widehat{\text{IRF}}(h) = \widehat{\bPsi}_h = \sum_{\ell=1}^{\text{min}[h,p]} \widehat{\bA}_\ell \widehat{\bPsi}_{h-\ell} . \]
Note that \(\widehat{\text{IRF}}(h)\) is a nonlinear function of the VAR coefficient estimators \(\widehat{\bA}\). Theoretically, the delta method can be used to derive the asymptotic distribution of the IRF estimator. However, given the distribution of \(\widehat{\bA}\) is poor due to the time series dependence, taking into account the nonlinearity of the IRF estimator, the asymptotic distribution of the IRF estimator can only be worse. For example, in the simple AR(1) model with coefficient estimate \(\widehat{\alpha}\), the \(\widehat{\text{IRF}}(h)\) is \(\widehat{\alpha}^h,\) which is highly nonlinear for even moderate horizons \(h.\) Consequently, the bootstrap can be used to construct confidence intervals for the IRF estimator.
The orthogonalized impulse response estimator is given by
\[ \widehat{\text{OIRF}}(h) = \widehat{\bPsi}_h \widehat{\bP} , \]
where \(\widehat{\bP} = \text{chol}(\widehat{\bSigma})\) so that \(\widehat{\bP} \widehat{\bP}' = \widehat{\bSigma}.\)
16.8 Predictive Regression
A standard VAR model a special case of a predictive regression model with one-step-ahead forecast horizon. An \(h\)-step predictive \(\text{VAR}(p)\) is where the dependent variable is dated \(h\) periods ahead of the right-hand-side variables:
\[ \by_{t+h} = \bb_0 + \bB_1 \by_{t} + \bB_2 \by_{t-1} + \cdots + \bB_p \by_{t-p+1} + \bu_t \tag{16.11}\]
The integer \(h\ge 1\) is called the forecast horizon. When \(h=1,\) the predictive VAR model reduces to the standard VAR model. The coefficients in Equation 16.11 should be viewed as the best linear predictors of \(\by_{t+h}\) based on the information in \((\by_t, \by_{t-1}, \ldots, \by_{t-p+1}).\)
An important relationship btw a VAR model and the corresponding \(h\)-step predictive VAR model is given by the following theorem.
Theorem 16.4 (VAR and predictive VAR) If \(\by_t\) is a \(\text{VAR}(p)\) process, then its \(h\)-step predictive regressive is a predictive \(\text{VAR}(p)\) model with
- \(\bu_t\) being a \(\text{MA}(h-1)\) process, and
- \(\bB_1 = \bPsi_h,\) which is the \(h\)-th impulse response matrix
Some important implications of this result are:
Given that \(\bu_t\) is a \(\text{MA}(h-1)\) process, the error term in the predictive VAR model is serially correlated.
Therefore, the OLS estimator of the predictive VAR model is consistent but inefficient. → The Newey-West covariance matrix estimator can be used to account for the serial correlation in the errors when conducting inference on the coefficients in the predictive VAR model.
The principal for choosing the correct lag length \(p\) in the predictive VAR model is to simply use the same lag length as the standard VAR model for all horizons.
We cannot use standard selection criteria such as AIC or BIC to choose the lag length in the predictive VAR model, because the error term is serially correlated.
We can use AIC on the one-step VAR model though. So we use the same \(p\) for all horizons.
Based on the property that \(\bB_1 = \bPsi_h,\) we obtain a director estimator of the impulse response function, called the Local Projection Estimator of the IRF:
\[ \widehat{\text{IRF}}(h) = \hat{\bB}_1 . \]
Since it is a direct estimator of the impulse response, it is possibly more robust than the conventional method. It is a linear estimator and thus likely to have a better-behaved asymptotic distribution.
A disadvantage of the local projection estimator is that it has serially correlated errors, making the estimator imprecise. Local projection estimators tend to be less smooth and more erratic than those produced by the conventional estimator, reflecting a possible lack of precision.
16.9 Forecast Error Variance Decomposition
The Forecast Error Variance Decomposition (FEVD) is a tool to quantify the contribution of each shock to the forecast error variance of each variable in a VAR system.
We write the moving average representation of the \(i\)-th component of \(\by_{t+h}\) as a function of the orthogonalized innovations \(\tilde{\bvarepsilon}_t\):
\[ \by_{i, t+h} = \bar{y}_i + \sum_{\ell=0}^\infty \bpsi_{i\ell}' \bP \tilde{\bvarepsilon}_{t+h-\ell} . \]
The best linear forecast of \(\by_{i, t+h}\) based on the information at time \(t\) is given by
\[ \Pcal_t[\by_{i,t+h}] = \by_{i, t+h \mid t} = \bar{y}_i + \sum_{\ell=h}^\infty \bpsi_{i\ell}' \bP \tilde{\bvarepsilon}_{t+h-\ell} . \]
The \(h\)-step forecast error is the difference
\[ \by_{i, t+h} - \by_{i, t+h \mid t} = \sum_{\ell=0}^{h-1} \bpsi_{i\ell}' \bP \tilde{\bvarepsilon}_{t+h-\ell} . \]
The variance of this forecast error is
\[ \begin{aligned} \var\left[\by_{i, t+h} - \by_{i, t+h \mid t}\right] &= \sum_{\ell=0}^{h-1} \var \left[ \bpsi_{i\ell}' \bP \tilde{\bvarepsilon}_{t+h-\ell} \right] \\ &= \sum_{\ell=0}^{h-1} \bpsi_{i\ell}' \bP \E\left(\tilde{\bvarepsilon}_{t+h-\ell} \tilde{\bvarepsilon}_{t+h-\ell}'\right) \bP' \bpsi_{i\ell} \\ &= \sum_{\ell=0}^{h-1} \bpsi_{i\ell}' \bP \bP' \bpsi_{i\ell} \end{aligned} \tag{16.12}\]
To isolate the contribution of the \(j\)-th shock to the forecast error variance, we can write the \(h\)-step forecast error in (16.12) as
\[ \var\left[\by_{i, t+h} - \by_{i, t+h \mid t} \right] = \sum_{j=1}^m \sum_{\ell=0}^{h-1} \bpsi_{i\ell}' \bp_j \bp_j' \bpsi_{i\ell} = \sum_{j=1}^m \sum_{\ell=0}^{h-1} \left(\bpsi_{i\ell}' \bp_j\right)^2. \]
The forecast error variance decomposition is defined as the ratio of the \(j\)-th shock’s contribution to the total forecast error variance:
\[ \text{FEVD}_{ij}(h) = \frac{\sum_{\ell=0}^{h-1} \left(\bpsi_{i\ell}' \bp_j\right)^2}{\sum_{j=1}^m \sum_{\ell=0}^{h-1} \left(\bpsi_{i\ell}' \bp_j\right)^2} . \]
The \(\text{FEVD}_{ij}(h)\) lies in \([0,1]\) and varies across \(h.\) Small values indicate that \(\tilde{\bvarepsilon}_{jt}\) contributes only a small amount to the variance of \(\by_{it}.\) Large values indicate that \(\tilde{\bvarepsilon}_{jt}\) contributes a large amount to the variance of \(\by_{it}.\)
A forecast error variance decomposition requires orthogonalized innovations. There is no non-orthogonalized version.
16.10 Granger Causality
VAR models can be used to test for Granger causality. In plain language, causality is inferred when lagged values of a variable, say \(x_t,\) have explantory power in a regression of another variable \(y_t\) on its own lags and the lags of \(x_t:\)
\[ y_t = \alpha_0 + \gamma y_{t-1} + \beta x_{t-1} + \varepsilon_t \]
We can test the null hypothesis of no Granger causality from \(x_t\) to \(y_t\) by testing \(H_0: \beta = 0.\)
More generally, we can say that \(x_t\) Granger-causes \(y_t\) if
\[ \E(y_t \mid y_{t-1}, x_{t-1}) \neq \E(y_t \mid y_{t-1}) . \]
In a two-variable VAR(1) model, let \(\by_t = [uemp_t, inf_t]'\) being a vector of unemployment and inflation. We have the VAR(1) model:
\[ \by_t = \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix} + \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \by_{t-1} + \begin{bmatrix}\varepsilon_{1t} \\ \varepsilon_{2t} \end{bmatrix} \]
To assert a Granger causality from \(uemp_t\) to \(inf_t,\) we need to test the null hypothesis that
\[ H_0: a_{12} = 0 \quad \text{vs.} \quad H_1: a_{12} \neq 0. \]
We can do a \(t\)-test for the coefficient \(a_{12}\). When \(a_{12}\) is statistically significant, we can say that \(inf\) Granger-causes \(uemp.\)
If there are more lags, we can do an \(F\)-test for the joint significance of the coefficients on the lags of \(uemp_t.\)
The notion can be extended to test if a given variable is weakly exogenous. If lagged values of \(x_t\) have no explanatory power for any of the variables in a system, then we would view \(x\) as weakly exogenous to the system.
References
- Chapter 19 Models with Lagged Variables. Econometric Analysis , pp.686. 5th Edition, by William H. Greene, Prentice Hall, 2003.
- Chapter 6.3 Serial Correlation: Estimating Autoregressions, pp. 387 (414). Fumio Hayashi, Econometrics, Princeton University Press, 2000.
- Chapter 15 Multivariate Regression. Econometrics. by Bruce Hansen, 2022.