Law of Large Numbers and Central Limit Theorem

↑

Convergence

Formal definition of convergence is:
Let $X_n=c+\frac{1}{n}$ for $n=1,2, \ldots$ and some constant $c$. $X_n$ converges to $c$ iff for $\forall \delta>0,$ there exists some values $n^{\ast}$ such that for all $n>n^*$ we have $\vert X_n-c \vert < \delta$.
Convergence in Probability, $\lim_{n\to\infty}P(\vert X_n-c\vert < \varepsilon) =1$, denoted as $\text{plim}_{n\to\infty}X_n=c$, or $X_n \xrightarrow{p} c$.
Almost Sure Convergence, $P(\lim_{n\to\infty}\vert X_n-c\vert =0)=1$, denoted as $X_n \xrightarrow{a.s.} c$.
Convergence in Mean Square, $\lim_{n\to\infty}\mathbb{E}[(X_n-c)^2]=0$, denoted as $X_n \xrightarrow{m.s.} c$.
Convergence in Distribution, $P(X_n\le x) \rightarrow P(X\le x)$ as $n\rightarrow \infty$, $X_n\xrightarrow{d}X$. If $X\sim N(\mu, \sigma^2)$, then $X_n \xrightarrow{d} N(\mu, \sigma^2)$.

Note:

Convergence in probability is weaker than both a.s. and m.s. convergence. Obtaining any type of convergence is sufficient. In certain cases, proving one type can be easier than another type.
Convergence in probability implies convergence in distribution. Convergence in probability means that two objects converge to each other. Convergence in distribution means that the distributions of two different objects become the same.

Consistent estimator

$\hat{\theta}$ is a consistent estimator of $\theta$ if

\[\hat{\theta} \xrightarrow{p} \theta.\]

Law of Large Numbers (LLN)

Chebyshev’s (week) LLN
Let $X_1, \ldots, X_n$ be an iid sequence with mean $\mathbb{E}(X_i)=\mu$ and $\text{Var}(X_i)=\sigma^2$. Then,
\[\begin{align*} \frac{1}{n}\sum_{i=1}^nX_i &\xrightarrow{p} \mathbb{E}(X_i). \\ (\text{i.e., } \overline{X}_n &\xrightarrow{p} \mu ) \end{align*}\]
LLN tells us that the sample mean is a consistent estimator of the expected value.
Note that 3 assumptions need to be satisfied:
1. $X_1, \ldots, X_n$ are an iid distributed;
2. exist finite mean $\mu$; and
3. exist finite variance $\sigma^2$.
Kolmogorov’s (strong) LLN
Let $X_1, \ldots, X_n$ be an iid sequence with mean $\mathbb{E}(X_i)=\mu$. Then
\[\frac{1}{n}\sum_{i=1}^nX_i \xrightarrow{a.s.} \mathbb{E}(X_i).\]
Note that the strong LLN requires only finite mean.

Kolmogorov’s LLN is very hard to prove. Weak LLN can be easily proved by showing m.s. convergence, which implies convergence in probability.

Proof of weak LLN:

Let $\overline{X}_n = \frac{1}{n}\sum_{i=1}^n X_i$ be the sample average. To prove weak LLN, we can show $\lim_{n\to\infty}\mathbb{E}(\overline{X}_n-\mu)^2=0$.
\[\begin{align*} \lim_{n\to\infty}\mathbb{E}(\overline{X}_n-\mu)^2 &= \lim_{n\to\infty} \text{Var}[\overline{X}_n] \\ &= \lim_{n\to\infty} \frac{\sigma^2}{n} \\ &= 0 \end{align*}\]
This shows $\overline{X}_n \xrightarrow{m.s} \mu$, which implies $\overline{X}_n \xrightarrow{p} \mu$.
LLN for random vectors
For $\boldsymbol{y}\in \mathbb{R}^m$, if $\boldsymbol{y}_i$ are independent and identically distributed and $E\,\vert\vert \boldsymbol{y} \vert\vert <\infty$, then as $n\to\infty$,
\[\overline{\boldsymbol{y}} = \frac{1}{n}\sum_{i=1}^n\boldsymbol{y}_i \xrightarrow{p} E\,[\boldsymbol{y}].\]
Note: convergence in probability of a vector can be defined as convergence in probability of all elements in the vector.

$E\,\vert\vert \boldsymbol{y} \vert\vert <\infty \Leftrightarrow E\,\vert y_j \vert <\infty $. Saying that the expected value of Euclidean norm is finite is equivalent to say that the expected value of each element is finite.
\[\overline{\boldsymbol{y}} = \frac{1}{n}\sum_{i=1}^n\boldsymbol{y}_i = \begin{pmatrix} \overline{y}_1 \\ \overline{y}_2 \\ \vdots \\ \overline{y}_m \\ \end{pmatrix}\]
- $E\,\vert\vert \boldsymbol{y} \vert\vert <\infty$ indicates 1st moment finite;
- $E\,\vert\vert \boldsymbol{y} \vert\vert^2 <\infty$ indicates 2nd moment finite.

Ergodic Theorem

If $\boldsymbol{y}_t$ is strictly stationary, ergodic, and $E\,\vert\vert \boldsymbol{y} \vert\vert <\infty $, then as $n\to\infty$,

\[E\vert\vert \overline{\boldsymbol{y}}-\boldsymbol{\mu} \vert\vert \to 0\]

and

\[\overline{\boldsymbol{y}} \xrightarrow{p} \boldsymbol{\mu}\]

where $\boldsymbol{\mu}=E[\boldsymbol{y}_t]$.

The ergodic theorem shows that ergodicity is sufficient for consistent estimation.

Note that instead of requiring iid, ergodicity imposes a weaker assumption which requires only stationarity and ergodicity. That is, serial dependence is allowed for in the time series.

Unbiasedness vs Consistency

Unbiasedness, $\mathbb{E}(\hat{\theta})=\theta$, is a finite sample property, that holds for any sample size.
Consistency, $\hat{\theta}\xrightarrow{p}\theta$, is a asymptotic property, that holds in the limit as $n\to \infty$.
Note: neither property implies the other.
- If we are interested in bias, we take the expectation on both sides of the equation.
- When we are interested in consistency we take the probability limit.

Central Limit Theorem (CLT)

Suppose that $X_1, \ldots, X_n$ is an iid sequence with mean $\mathbb{E}(X_i)=\mu$ and $\text{Var}(X_i)=\sigma^2$. Let $\overline{X}=\sum_{i=1}^n X_i.$ Then,

\[\frac{\overline{X}-\mathbb{E}[\overline{X}]}{\sqrt{\text{Var}[\overline{X}]}} = \frac{\overline{X}-\mu}{\sqrt{\sigma^2/n}} = {\color{#008B45FF} \sqrt{n}} \cdot \frac{\overline{X}-\mu}{\sigma} \xrightarrow{d} N(0,1)\]

equivalently, we can write

\[\sqrt{n}\cdot\frac{\overline{X}-\mu}{\sigma} \overset{a}{\sim} N(0,1)\]

$\overset{\rm a}{\sim}$ means “approximately distributed with”.
Or we can also write as

\[\begin{align*} \begin{array}{rlrl} \sqrt{n} (\overline{X}-\mu) &\xrightarrow{d} N(0,\sigma^2), & \sqrt{n} (\overline{X}-\mu) &\overset{a}{\sim} N(0,\sigma^2) \\ \overline{X} -\mu &\xrightarrow{d} N(0,\sigma^2/n), & \overline{X} -\mu &\overset{a}{\sim} N(0,\sigma^2/n) \\ \overline{X} &\xrightarrow{d} N(\mu,\sigma^2/n), & {\color{#008B45FF} \overline{X}} & {\color{#008B45FF} \overset{a}{\sim} N(\mu,\sigma^2/n) }. \end{array} \end{align*}\]

Note: The CLT is a very powerful result. $X_1, \ldots, X_n$ can be from any possible distribution (iid with finite mean and variance), and still their normalized sample mean will be standard normal.

In practice, we replace $\sigma$ with $\widehat{\sigma}$ because we do not observe $\sigma$ but we do observe $\widehat{\sigma}$.
Population variance estimators, $\widehat{\sigma}^2$. Two versions:
- biased sample variance: $s_1^2 = \frac{1}{n} \sum_{i=1}^n [(X_i-\overline{X})^2]$ and
- unbiased sample variance: $s_2^2 = \frac{1}{n-1} \sum_{i=1}^n [(X_i-\overline{X})^2]$
- Both options are consistent; using either is fine.

Define the partial sum $S_n := \sum_{i=1}^n X_i = n\overline{X}_n$. By LLN, we have

\[\frac{\overline{X}-\mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0,1)\]

Multiply the denominator and numerator by $n$, we have

\[\begin{align*} \frac{n\overline{X}-n\mu}{n\cdot\sigma/\sqrt{n}} = \frac{S_n-n\mu}{\sigma\sqrt{n}} \end{align*}\]

which converges in distribution to

\[\frac{S_n-n\mu}{\sigma \sqrt{n}} \xrightarrow{d} N(0,1) .\]

CLT for Martingale Differences

We are interested in an asymptotic approximation for the distribution of standardized sample means such as

\[\boldsymbol{S}_n = \frac{1}{\sqrt{n}} \sum_{t=1}^n \boldsymbol{u}_t\]

where $\boldsymbol{u}_t$ is martingale difference sequence (MDS) with mean zero with finite variance $E[\boldsymbol{u}_t\boldsymbol{u}_t’]=\boldsymbol{\Sigma}<\infty$.

The MDS CLT theorem If $\boldsymbol{u}_t$ is a strictly stationary and ergodic martingale difference sequence and $E[\boldsymbol{u}_t\boldsymbol{u}_t']=\boldsymbol{\Sigma}<\infty$, then as $n\to\infty,$ $$ \boldsymbol{S}_n = \frac{1}{\sqrt{n}} \sum_{t=1}^n \boldsymbol{u}_t \overset{d}{\to} N(\boldsymbol{0}, \boldsymbol{\Sigma}). $$

The conditions for the theorem are similar to the Lindeberg-Lévy CLT. The only difference is that the i.i.d. assumption has been replaced by the assumption of a strictly stationarity and ergodic MDS.

Recall that a MDS has mean zero and is an uncorrelated process.

CLT for Correlated Observations

In classical settings, the Central Limit Theorem (CLT) assumes that the observations are independent or form a Martingale Difference Sequence (MDS). However, in time series data and many empirical applications, serial correlation (i.e., dependence across time) is common. Therefore, it is important to extend the CLT to account for such correlation in the data.

We now relax the assumption of MDS, and allow for serial correlation in $\bu_t.$

We aim to develop a CLT for the normalized mean $\bS_n$ defined as

\[\boldsymbol{S}_n = \frac{1}{\sqrt{n}} \sum_{t=1}^n \boldsymbol{u}_t ,\]

where $\bu_t$ is serially correlated.

Univariate Case

We first consider a univariate case where $u_t$ is a real-valued stationary process with:

\[\begin{split} \E(u_t) &= \mu \\ \var(u_t) &= \sigma^2 \\ \cov(u_t, u_{t-\ell}) &= \gamma(\ell) = \gamma(-\ell) . \end{split}\]

The variance of the normalized sum is:

\[\begin{split} \var(S_n) &= \sigma^2 + 2\sum_{\ell=1}^{n-1} \left(1-\frac{\ell}{n}\right) \gamma(\ell) \\ &= \sum_{\ell=1-n}^{n-1} \left(1-\frac{\abs{\ell}}{n}\right) \gamma(\ell) \end{split}\]

The weight $1-\frac{\abs{\ell}}{n}$ downweights autocovariances further from lag zero. This is often called a “triangular” weighting scheme, and it plays a key role in understanding long-run variance behavior.

Multivariate Case

In the vector case, where $\bu_t\in \R^m,$ we have

\[\begin{split} \E(\bu_t) &= \bmu \\ \var(\bu_t) &= \bSigma \\ \cov(\bu_t, \bu_{t-\ell}) &= \bGamma(\ell) = \bGamma(-\ell)' . \end{split}\]

The variance of the scaled sum $\bS_n$ is now a matrix:

\[\begin{split} \var(\bS_n) &= \bSigma + \sum_{\ell=1}^{n-1} \left(1-\frac{\ell}{n}\right) \left(\bGamma(\ell) + \bGamma(-\ell)\right) \\ &= \sum_{\ell=1-n}^{n-1} \left(1-\frac{\abs{\ell}}{n}\right) \bGamma(\ell) \\ &= \frac{1}{n} \sum_{\ell=1-n}^{n-1} \left(n-\abs{\ell}\right)\, \bGamma(\ell) \end{split}\]

This representation shows that lags closer to zero get higher weight, and lags with large absolute values contribute less due to the $n-\abs{\ell}$ weighting.

Cesàro-Type Representation

To analyze convergence, let’s first take a look at the positive lags portion of the sum:

\[\begin{equation}\label{eq-cesaro-mean} \frac{1}{n} \sum_{\ell=1}^{n-1} \left(n-\abs{\ell}\right) \,\bGamma(\ell) = \frac{1}{n} \sum_{\ell=1}^{n-1} \sum_{j=1}^\ell \bGamma(j) \end{equation}\]

which is in the form of Cesàro Means with $a_\ell = \sum_{j=1}^\ell \bGamma(j).$ The convergence condition of \eqref{eq-cesaro-mean} requires

\[a_\ell = \sum_{j=1}^\ell \bGamma(j) \to \bold{0} \quad \text{ as } \ell \to \infty ,\]

which requires that the autocovariances decay sufficiently fast.

Long-Run Variance and Asymptotic Normality

Define the long-run covariance matrix:

\[\bOmega = \sum_{\ell=-\infty}^\infty \bGamma(\ell) < \infty .\]

This matrix captures the total contribution of autocorrelation across all lags.

If $\bOmega$ is finite (i.e., the sum of autocovariances is absolutely convergent), then:

\[\var(\bS_n) \to \bOmega \quad \text{ as } n\to\infty.\]

CLT for Correlated Processes If $\bu_t$ is strictly stationary with $\E(\bu_t)=\bold{0},$ and $\cov(\bu_t, \bu_{t-\ell}) = \bGamma(\ell),$ and the long run variance $\bOmega = \sum_{\ell=-\infty}^\infty \bGamma(\ell)$ is finite, then $$ \bS_n = \frac{1}{\sqrt{n}} \sum_{t=1}^n \boldsymbol{u}_t \xrightarrow{d} N(\bold{0}, \bOmega). $$

This result allows us to use asymptotic normality in estimation and inference even when data exhibit serial dependence, as long as the dependence decays fast enough to ensure finite long-run variance.