Econometric Notes
Basic Probability Properties
-
Probability of a union
\[P(A \cup B) = P(A) + P(B) - P(A \cap B)\]If $A$ and $B$ are mutually exclusive, then
\[\begin{split} P(A \cap B) &= 0 \\ P(A \cup B) &= P(A) + P(B). \end{split}\]This is also called the addition rule.
-
Conditional probability
\[P(A|B) = \frac{P(A \cap B)}{P(B)}\]If A and $B$ are independent, then $P(A\vert B) = P(A)$ and $P(B\vert A) = P(B)$.
-
Multiplication rule
\[P(A \cap B) = P(A|B) P(B) = P(B|A) P(A)\]If A and $B$ are independent, then
\[P(A \cap B) = P(A) P(B) .\] -
The partition rule
If $B_1, B_2, \ldots, B_m$ form a partition of the sample space $\Omega$, then
\[P(A) = \sum_{i=1}^m P(A \cap B_i)= \sum_{i=1}^m P(A|B_i) P(B_i)\]for any event $A$.
As a special case, $B$ and $\overline{B}$ partition $\Omega,$ so
\[\begin{split} P(A) &= P(A\cap B) + P(A\cap \overline{B}) \\ &= P(A|B) P(B) + P(A|\overline{B}) P(\overline{B}). \end{split}\] -
Bayesâ theorem
-
Chains of events
If $A_1, A_2, \ldots, A_n$ are events in the sample space $\Omega$, then
\[P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) P(A_2|A_1) P(A_3|A_1 \cap A_2) \cdots P(A_n|A_1 \cap A_2 \cap \cdots \cap A_{n-1}).\]
Expectation
Expectation are denoted by $\mathbb{E}(X)$ or $\mathbb{E}_X(X)$ to denote the expectation is taken over the RV $X.$
\[\mathbb{E}(X) = \begin{cases} \displaystyle \sum_{x}x f(x) & \text{for discrete } X \\ \displaystyle \int_{-\infty}^{\infty} x f(x)\, \mathrm dx & \text{for continuous } X \\ \end{cases}\]Conditional Expectations
The conditional expectation for $Y$, given that $X$ has a prescribed value, is defined as follows:
\[\mathbb E(Y|X=x) = \sum_j y_j P(Y=y_j|X=x) = \sum_j y_j f_{Y|X}(x,y_i) \, ,\]which is a function of $X.$
The conditional expectation for $X$ given $Y$, $\mathbb E(X\vert Y=y)$, is defined as follows:
\[\mathbb E(X\vert Y=y) = \sum_i x_i P(X=x_i\vert Y=y) = \sum_i x_i f_{X\vert Y}(x_i,y) \, ,\]which is a function of $Y.$
We denote the conditional expectation for $Y$ given $X$ as follows:
\[\varphi_X(x) = \mathbb E(Y|X=x) \, .\]Technically, this is termed the regression function of $Y$ on $X$.
The expectation of the regression function is:
\[\begin{aligned} \mathbb E(\varphi_X(x)) &= \sum_i \varphi_X(x_i) f_X(x_i) \\ &= \sum_i \left\lbrace \sum_j y_j \, \frac{f_{XY}(x_i,y_j)}{f_X(x_i)} \right\rbrace f_X(x_i) \\ &= \sum_j y_j \sum_i f_{XY}(x_i,y_j) \\ &= \sum_j y_j f_Y(y_j) \\ &= \mathbb E(Y) \,. \end{aligned}\]Law of Iterated Expectations (LIE)
Fora any random variable $X\in \mathcal X$ and $Y\in \mathcal Y \subset \mathbb R$,
\[\begin{align*} \mathbb{E}_Y(Y) = \mathbb{E}_X\left[\mathbb{E}_{Y\vert X}\right(Y\vert X=x\left)\right] = \sum_{x\in\mathcal X} \mathbb{E}[Y\vert X=x]\, \mathbb{P}(X=x) . \end{align*}\](the subscript in $\mathbb{E}_Y$ and $\mathbb{E}_X$ indicate on which variable the expectation is taken) or more succinctly
\[\begin{align*} \mathbb{E}[Y] = \mathbb{E}\left[\mathbb{E}\right(Y\vert X\left)\right]. \end{align*}\]It represents a transformation from conditional to unconditional expectation. The expected value (this expectation is with respect to $X$) of the conditional expectation of $Y$ given $X$ is the expected value of $Y$.
LIE is also called the law of total expectation, which can be derived from the law of total probability. We will see this in what follows. LIE is also referred to as âAdamâs Law.â
Suppose $\mathbb{E}(Y\vert X)=0$, then
\[\begin{align*} 1. \quad & \mathbb{E}[Y] = \mathbb{E}\left[\mathbb{E}\right(Y\vert X=x\left)\right] = 0 \\ 2. \quad & \mathbb{E}[g(X)\cdot Y] = \mathbb{E}\left[\mathbb{E}[g(X)\cdot Y\vert X]\right] = \mathbb{E}\left[g(X)\cdot \mathbb{E}[Y\vert X]\right] = \mathbb{E}[g(X)\cdot0]=0 \end{align*}\]Note that we start from the conditional expectation being $0$, and conclude that unconditional expectation and its product with a function of $X$ are $0$âs too.
Conditional expectation is a stronger condition than unconditional expectation.
In other words, although conditional expectations can vary arbitrarily for different values of $X$, if you know what the conditional expectations are, the overall expected value of $Y$ is fully determined.
A simple example is one where $X$ takes only two values. Suppose we are interested in mean birthweight ($Y$) for children of motherâs who either drank alcohol during their pregnancy ($X=1$) or who didnât drink alcohol during their pregnancy ($X=0$).
Suppose the following numbers:
\[\begin{aligned} \mathbb{E}[Y|X=1] &= 7, \;\;\; \mathbb{P}(X=1) = 0.1\\ \mathbb{E}[Y|X=0] &= 8, \;\;\; \mathbb{P}(X=0) = 0.9 \\ \end{aligned}\]The law of iterated expectation says that
\[\begin{aligned} \mathbb{E}[Y] &= \mathbb{E}_X \big[ \mathbb{E}[Y|X] \big] \\ &= \sum_{x \in \mathcal{X}} \mathbb{E}[Y|X=x] \cdot \mathbb{P}(X=x) \\ &= \mathbb{E}[Y|X=0] \cdot \mathbb{P}(X=0) + \mathbb{E}[Y|X=1] \cdot \mathbb{P}(X=1) \\ &= (8) \times (0.9) + (7) \times (0.1) \\ &= 7.9 \end{aligned}\]Expressing mathematically, suppose we have a sample space, $X={A, A^c}$, then
\[\mathbb E (Y) = \mathbb E (Y|A)\, P(A) + \mathbb E (Y|A^c)\, P(A^c)\]Identities for conditional expectations
- LIE (Adamâs Law) $\mathbb{E}[Y] = \mathbb{E}\left[\mathbb{E}\right(Y\mid X\left)\right].$
-
Generalized Adamâs Law
\[\E\left[ \E[Y \mid g(X)] \mid f(g(X)) \right] = \E[Y\mid f(g(X))]\]for any $f$ and $g$ with compatible domains and ranges. We also have that
\[\E\left[ \E[Y \mid g(X)] \mid f(g(X)=z) \right] = \E[Y\mid f(g(X))=z]\] -
Independence: $\E[Y\mid X] = \E[Y]$ if $X$ and $Y$ are independent.
-
Taking out what is known: $\E[h(X)Z\mid X]=h(X)\E[Z\mid X]$
-
Linearity: $\E[aX+bY\mid Z] = a\E[X\mid Z] + b\E[Y\mid Z],$ for any $a, b\in \R.$
-
Projection interpretation: $\E\left[\left(Y-\E[Y\mid X]\right)h(X)\right] = 0$ for any function $h: \Xcal \to \R.$
- Keeping just what is needed: $\E[XY] = \E[X\E[Y\mid X]]$ for $X, Y\in \R.$
More about LIE: https://davidrosenberg.github.io/ttml2021fall/background/conditional-expectation-notes.pdf
Partitioning and Conditioning
In the general case, consider a partition of the sample space: ${X_1, X_2, \ldots, X_n}$, each event has a corresponding probability: $P(X_1), P(X_2), \ldots, P(X_n)$.
Given another event $Y$, then according to the partition rule we have:
\[\begin{aligned} P(Y) &= \sum_i^n P(Y \cap X_i) \\ &= \sum_i^n P(Y \vert X_i)\, P (X_i)\\ &= P(Y|X_1)P(X_1) + P(Y|X_2)P(X_2) + \cdots + P(Y|X_n)P(X_n) . \end{aligned}\]This is called the âlaw of total probabilityâ.
Then it follows that:
\[\begin{aligned} \mathbb{E}(Y) &= \sum_j y_j P(Y=y_j) \\ &= \sum_j y_j \left[P(Y=y_j|X_1)P(X_1) + P(Y=y_j|X_2)P(X_2) + \cdots + P(Y=y_j|X_n)P(X_n) \right] \\ &= \sum_j y_jP(Y=y_j|X_1)\,P(X_1) + \sum_j y_jP(Y=y_j|X_2)\,P(X_2) + \cdots + \sum_j y_jP(Y=y_j|X_n)\,P(X_n) \\ &= \mathbb{E}(Y|X_1)\,P(X_1) + \mathbb{E}(Y|X_2)\,P(X_2) + \cdots + \mathbb{E}(Y|X_n)\,P(X_n) \\ &= \mathbb{E}\left[\mathbb{E}(Y|X)\right] \end{aligned}\]A reverse proof starting from $\mathbb{E}\left[\mathbb{E}(Y\vert X)\right]$.
\[\begin{align*} \mathbb{E}\left[\mathbb{E}(Y|X)\right] &= \sum_{x\in\mathcal X} p(x)\, \mathbb{E}(Y|X=x) \\ &= \sum_{x\in\mathcal X} \left[p(x) \sum_{y\in\mathcal Y} y\, p(y|x) \right] \\ &= \sum_{y\in\mathcal Y} \left[ y \sum_{x\in\mathcal X} p(x,y) \right] \\ &= \sum_{y\in\mathcal Y} y\,p(y) \\ &= \mathbb E(Y) \tag*{\(\square\)} \end{align*}\]In case of continuous variables, we have
\[\begin{align} \mathsf E(Y) &= \int_\Bbb R y\,f_Y(y)\,\mathrm d y && \text{by definition of expectation} \\[1ex] &= \int_\Bbb R y\int_\Bbb R f_{Y\mid X}(y\mid x)~f_X(x)\,\mathrm d x\,\mathrm d y &&\text{Law of Total Probability} \\[1ex] &= \int_\Bbb R f_X(x)\int_\Bbb R y~f_{Y\mid X}(y\mid x)\,\mathrm d y\,\mathrm d x &&\text{Fubini's Theorem } \\[1ex] &= \int_\Bbb R f_X(x)\,\mathsf E(Y\mid X{\,=\,}x)\,\mathrm d x && \text{by definition of conditional expectation} \\[1ex] &= \mathsf E\left[\mathsf E(Y\mid X)\right] && \text{by definition of expectation} \end{align}\]Generalization of LIE in time series
\[\mathbb{E}\big[\mathbb{E}(y_{t+2}|x_{t+1}) |x_t \big] = \mathbb{E}[y_{t+2} |x_t]\]as $x_{t} \subset x_{t+1}$.
More generally, for any random variable $z$ and two information sets $\mathcal{J}$ and $\mathcal{I}$ with $\mathcal{J} \subset \mathcal{I}$, then
\[\mathbb{E}\big[\mathbb{E}(x|\mathcal{I}) |\mathcal{J} \big] = \mathbb{E}[x |\mathcal{J}]\]Intuition behind the LIE
Think of $\mathbf{x}$ as a discrete vector taking on possible values $\mathbf{c}_1, \mathbf{c}_2, \ldots, \mathbf{c}_M$ with probabilities $p_1, p_2, \ldots, p_M$. Then LIE says:
\[\mathbb{E}(y) = p_1 \mathbb{E}(y\vert \mathbf{x}=\mathbf{c}_1) + p_2 \mathbb{E}(y\vert \mathbf{x}=\mathbf{c}_2) + \cdots + p_M \mathbb{E}(y\vert \mathbf{x}=\mathbf{c}_M).\]Ths is, $\mathbb{E}(y)$ is simply a weighted average of the $\mathbb{E}(y\vert \mathbf{x}=\mathbf{c}_i)$, where the weight $p_i$ is the probability that $\mathbf{x}$ takes on the value of $\mathbf{c}_i$. In other words, a weighted average of averages.
E.g., suppose we are interested in average IQ generally, but we have measures of average IQ by gender. We could figure out the quantity of interest by weighting average IQ by the relative proportions of men and women.
Bayesâ Theorem č´ĺśćŻĺŽç
Partition Theorem (total expectation theorem, law of total probability) Intuition
\[\mathbb{P}(\text{eventual goal}) = \sum_{\text{options}} \mathbb{P}(\text{eventual goal}|\text{option})\, \mathbb{P}(\text{option})\]Decomposition of variance
Or sometimes called Law of Total Variance.
\[\text{Var}(Y) = \text{Var}_\mathbf{x}[\mathbb{E}(y\vert \mathbf{x})] + \mathbb{E}_\mathbf{x}[\text{Var}(y\vert \mathbf{x})]\]In plain language, the variance of $Y$ decomposes into the variance of the conditional mean plus the expected variance around the conditional mean.
Population and Sample
Population quantities need to know the DGP. Any thing you observe is for samples.
Expectation operator is a linear operator, meaning that we have
\[\begin{align*} \mathbb{E}(a+bX)=a+b\,\mathbb{E}(X) \end{align*}\]More generally, let $a_1, \ldots, a_n$ and $b_1, \ldots, b_n$ be sequences of non-random variables and let $X_1, \ldots, X_n$ be a sequence of random variables. Then,
\[\begin{align*} \mathbb{E}\left[\sum_{i=1}^n(a_i+b_iX_i)\right] = \sum_{i=1}^n \mathbb{E}(a_i+b_iX_i) = \sum_{i=1}^n\left(a_i+b_i\mathbb{E}[X_i]\right). \end{align*}\]Linear Transformations of a Random Vector
Let $Y=A+BX$, $X$ is $p\times 1$ random vector, $A$ is $q \times 1$ non-random vector, $B$ is $p\times q$ non-random matrix.
The expected value of this transformation is given by
\[\mathbb{E}(Y) = A + B\, \mathbb{E}(X)\]The variance of this transformation is given by
\[\text{Var}(Y) = B\, \text{Var}(X)\, B'\]A $q \times q$ square matrix $\Sigma$ is called non-negative definite (or positive semi-definite) if for any non-zero $q \times 1$ vector $\boldsymbol{a}$ it holds that
\[\boldsymbol{a}'\Sigma\, \boldsymbol{a} \ge 0\]If the square matrix $\Sigma$ is non-negative definite, we write $\Sigma \ge 0$.
Note that all covariance matrix of a random vector is positive semi-definite.
Expectations of Functions of RVs
If $X$ is a RV, then expected value of $g(X)$ is given by
\[\begin{align*} \mathbb{E}[g(X)] = \left\{ \begin{array}{ll} \int g(x)f(x)dx & \text{for continuous $X$} \\ \sum_xg(x)f(x) & \text{for discrete $X$} \end{array} \right. \end{align*}\]where $f(x)$ is the probability density (mass) function of continuous (discrete) $X$.
Variance is also an expectation by setting $g(X) = \left[X-\mathbb{E}(X)\right]^2$. In other words, $\text{Var}(X) = \mathbb{E}\left[(X-\mathbb{E}(X))^2\right]$.
\[\begin{align*} \text{Var}(X) = \mathbb{E}[(X-\mathbb{E}(X))^2] = \left\{ \begin{array}{ll} \int [x-\mathbb{E}(X)]^2 f(x)dx & \text{for continuous $X$} \\ \sum_x [x-\mathbb{E}(X)]^2 f(x) & \text{for discrete $X$} \end{array} \right. \end{align*}\]Example: Bernoulli
Let $X \sim \textrm{Bernoulli}(\theta)$, and recall that we have $\mathbb{E}(X)=\theta$. Then
\[\begin{align*} \text{Var}(X) &= \mathbb{E}(X-E[X])^2 \\ &= \sum_x \left(x-\mathbb{E}[X]\right)^2 f(x) \\ &= (0-\theta)^2 \times f(0) + (1-\theta)^2\times f(1) \\ &= \theta^2(1-\theta) + (1-\theta)^2\theta \\ &= \theta (1-\theta). \end{align*}\]Alternative derivation: Since $0^2=0$ and $1^2=1$, we have $X^2=X$ implying that $\mathbb{E}(X^2)=\mathbb{E}(X)=\theta$. Therefore,
\[\begin{align*} \text{Var}(X)=\mathbb{E}[X^2]-(\mathbb{E}[X])^2 = \theta-\theta^2 \end{align*}\]Sample mean Let $X_1, \ldots, X_n$ denote $n$ observations on a variable $X$, the sample mean is
\[\begin{align*} \overline{X}=\frac{1}{n}\sum_{i=1}^nX_i \end{align*}\]Sometimes, you add a subscript $n$ to denote the sample size, $\overline{X}_n .$
$\overline{X}$ is a random variable, as it is the average of random variables. This is in sharp contrast to $\mathbb{E}[X]$ which is non-random.
$\overline{X}$ varies with each sample. If we could repeatedly collect new samples of size $n$ from the same population and each time were able to estimate $\overline{X}$, these estimates would be different from each other. The distribution of a statistic, like $\overline{X}$, is called its sampling distribution.
One useful feature is $\mathbb{E}[\overline{X}] = \mathbb{E}[X]$. This doesnât mean that $\overline{X}$ itself is equal to $\mathbb{E}[X]$. Rather, it means that, if we could repeatedly obtain (a huge number of times) new samples of size $n$ and compute $\overline{X}$ each time, the average of $\overline{X}$ across repeated samples would be equal to $\mathbb{E}[X].$
Proof:
\[\begin{aligned} \mathbb{E}[\overline{X}] &= \mathbb{E}\left[ \frac{1}{n} \sum_{i=1}^n X_i \right] \\ &= \frac{1}{n} \mathbb{E}\left[ \sum_{i=1}^n X_i \right] \\ &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X_i] \\ &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X] \;\;\;\; (X_i\text{ are iid with mean } \mathbb{E}[X]) \\ &= \frac{1}{n}\, n\, \mathbb{E}[X] \\ &= \mathbb{E}[X] \end{aligned}\]Two ways to compute the Sample variance
-
unadjusted sample variance, also called biased sample variance: \(\begin{align*} S_n^2 = \frac{1}{n}\sum_{i=1}^n(X_i-\overline X)^2 \end{align*}\)
-
adjusted sample variance, also called unbiased sample variance
The latter subtracts 1 from $n$ in the denominator, which is known as a degrees of freedom correction. See proof here.
Distinguish the sample variance $(S^2_n)$ from the variance of sample mean $(\text{Var}(\overline{X}))$.
Covariance and Correlation
Covariance and correlation measure the linear association btw two RVs $X$ and $Y$
\[\begin{aligned} \gamma &\equiv \text{Cov}(X, Y) = \mathbb{E}\big\{[X-\mathbb{E}(X)][Y-\mathbb{E}(Y)]\big\} \\ \rho &\equiv \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \end{aligned}\]where the expected value $\mathbb{E}[\cdot]$ is taken over the joint distribution of $(X,Y)$.
More formally
\[\begin{aligned} \textrm{Cov}(X, Y) = \begin{cases} \sum_y\sum_x[X-\mathbb{E}(X)][Y-\mathbb{E}(Y)]f(x,y) & \textrm{for } X, Y \textrm{ discrete} \\ \iint[X-\mathbb{E}(X)][Y-\mathbb{E}(Y)]f(x,y) dxdy & \textrm{for } X, Y \textrm{ continuous} \\ \end{cases} \end{aligned}\]where the sum and integral are over the supports of $X$ and $Y$.
For linear transformations, we have
\[\text{Cov}(a+bX, c+dY) = bd\,\text{Cov}(X,Y)\]for known constants $a,b,c,d$.
The additive law of covariance holds that the covariance of a random variable with a sum of random variables is just the sum of the covariances with each of the random variables.
\[\text{Cov}(X+Y, Z) = \text{Cov}(X,Z) + \text{Cov}(Y,Z)\]More generally,
\[\color{#008B45}\text{Cov}\left(\sum_{i=1}^m a_iX_i, \sum_{j=1}^n b_iY_i\right) = \sum_{i=1}^m\sum_{j=1}^n a_ib_j\text{Cov}(X_i, Y_j).\]One of the applications of covariance is finding the variance of a sum of several random variables. In particular, if $Z=X+Y$, then
\[\begin{aligned} \text{Var}(Z) &= \text{Cov}(Z,Z) \\ &= \text{Cov}(X+Y, X+Y) \\ &= \text{Cov}(X,X) + \text{Cov}(X,Y) + \text{Cov}(Y,X) + \text{Cov}(Y,Y) \\ &= \color{#008B45}\text{Var}(X) + \text{Var}(X) + 2\text{Cov}(X,Y). \end{aligned}\]More generally, for $a_i\in \mathbb{R}, i=1,\ldots,n$, we conclude:
\[\color{#008B45} \text{Var}\left(\sum_{i=1}^n a_iX_i \right) = \sum_{i=1}^n a_i^2 \text{Var}(X_i) + \sum_{i=1}^n\sum_{j=1}^n a_ia_j \text{Cov}(X_i, X_j).\]Or equivalently,
\[\color{#008B45} \text{Var}\left(\sum_{i=1}^n a_iX_i \right) = \sum_{i=1}^n a_i^2 \text{Var}(X_i) + 2\sum_{i=2}^n\sum_{j=1}^{i-1} a_ia_j \text{Cov}(X_i, X_j).\]If we have either $\mathbb{E}(X)=0$ or $\mathbb{E}(Y)=0$ or both
\[\text{Cov}(X, Y) = \mathbb{E}{(XY)} - \mathbb{E}{(X)}\mathbb{E}{(Y)} = \mathbb{E}{(XY)}\]The sample covariance, $\hat{\gamma}$, in a sample of $n$ observations on $(X_i,Y_i)$ is
\[\hat{\gamma} = \frac{1}{n-1}\sum_{i=1}^n (X_i-\overline{X})(Y_i-\overline{Y})\]Division by $n-1$ rather than $n$ is called a degrees of freedom correction.
Correlation is a scaled measure of covariance:
\[\textrm{Corr}(X,Y)=\frac{\textrm{Cov}(X, Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}}\]If $\textrm{Corr}(X,Y)=0$, we say that $X$ and $Y$ are uncorrelated or orthogonal, denoted by $X {\color{#008B45FF}\perp} Y$ (perpendicular symbol).
$X {\color{#008B45FF}\indep} Y$ (double perpendicular symbol) denotes $X$ and $Y$ are independent.
$X \indep Y \Rightarrow X \perp Y$, in plain language, independence implies zero correlation.
Dependent but uncorrelated example
Let $X\sim U(-1,1)$ (äťťä˝ĺŽäšĺcentered on zero is ok), $Y=f(X)=X^2$. Then $E[X]=0$.
\[\begin{aligned} \text{Cov}(X, Y) &= E[Xf(X)] - E[X]E[f(X)] \\ &= E[Xf(X)] \quad (E[X]=0) \\ &= \int_{-1}^1 x\,f(x)\,p(x)\, dx \end{aligned}\]where $p(x)$ is the pdf of $X$.
\[p(x) = \left\{ \begin{array}{ll} \frac{1}{2} & x\in [-1,1] \\ 0 & \text{otherwise} \end{array} \right.\]Now we have $X$ as an odd function and $[-1,1]$ is a symmetric domain. That means as long as $f(x)$ is an even function, we have $xf(x)p(x)$ as on odd function. It then follows that
\[\int_{-a}^a x\,f(x)\,p(x)\, dx = 0,\]and so $\text{Cov}(X, f(X))=0$.
$f(X)=X^2$ is an even function, which satisfies the condition.
Properties of covariance
-
The covariance of a variable with itself is the variance of the random variable.
\[\cov(X, X) = \var(X)\] -
The covariance of a random variable, $X$, with a constant, $c$ is zero.
\[\cov(c, X) = 0\] -
The covariance is commutative.
\[\cov(X, Y) = \cov(Y,X)\] -
If $X$ and $Y$ are independent then
\[\cov(X,Y)=0\] -
Adding a constant to either or both random variables does not change their covariances.
\[\cov(X+c, Y+k) = \cov(X,Y)\] -
Multiplying a random variable by a constant multiplies the covariance by that constant.
\[\cov(cX, kY) = ck\times\cov(X,Y)\] -
The covariance of a random variable with a sum of random variables is the sum of the covariances with each of the random variables.
\[\cov(X+Y, Z) = \cov(X,Z) + \cov(Y,Z)\] -
More generally, covariance of sum of random variables:
\[\cov\left(\sum_{i=1}^ma_iX_i, \sum_{j=1}^nb_jY_j\right)= \sum_{i=1}^m\sum_{j=1}^na_ib_j\cov(X_i,Y_j)\]
Properties for correlation coefficients
-
Adding a constant to a random variable does not change their correlation coefficient.
\[\cor(X+c, Y+k) = \cor(X,Y)\] -
Multiplying a random variable by a constant does not change their correlation coefficient.
\[\cor(cX, dY) = \cor(X,Y)\]
Joint Distributions and Independence
Joint Distribution Function (Joint CDF)
\[\begin{aligned} F(x_1, x_2) &= P(X_1\le x_1, X_2\le x_2) \\ &= \left\{ \begin{array}{ll} \displaystyle \int_{-\infty}^{x_1}\int_{-\infty}^{x_2} f(a, b)\, \mathrm da\, \mathrm db & \text{continuous} \\ \displaystyle \sum_{a\le x_1} \sum_{b\le x_2} f(a,b) & \text{discrete} \end{array} \right. \end{aligned}\]Joint Density Function (Joint PDF)
\[f\,(x_1, x_2) = \left\{ \begin{array}{ll} \displaystyle \frac{\partial^2F(x_1, x_2)}{\partial x_1\, \partial x_2} & \text{continuous} \\ \displaystyle P(X_1=x_1, X_2=x_2) & \text{discrete} \end{array} \right.\]In other words, for the continuous case, $f(x_1,x_2)$ is the function that satisfies
\[F(x_1, x_2) = \int_{\infty}^{x_2}\int_{\infty}^{x_1} f(a,b)\, \mathrm da\mathrm db\]Marginals
Consider a discrete random vector, when one of these entries is taken in isolation, its distribution can be characterized in terms of its probability mass function. This is called marginal probability mass function, in order to distinguish it from the joint probability mass function (PMF), which is instead used to characterize the joint distribution of all the entries of the random vector considered together.
Formal Definition Let $X_1, \ldots, X_K$ be $K$ discrete random variable forming a $K\times 1$ random vector. Then, for each $i=1,\ldots,K,$ the probability mass function of the random variable $X_i$, denoted by $p_{X_i}(x_i)\, ,$ is called marginal probability mass function.
$p_{X_i}(x_i)$ is a function: $\mathbb R \mapsto [0,1]$ such that
\[p_{X_i}(x_i) = P(X_i=x_i)\]where $p_{X_i}(x_i)$ is the probability that $X_i$ will be equal to $x_i$.
By contrast, the joint probability mass function of the vector $X$ is a function $p_X: \mathbb R^K \mapsto [0,1]$ such that
\[p_{X}(x) = p_{X_1,\ldots,X_K}(x_1, \ldots, x_K) = P(X_1=x_1, \ldots, X_K=x_k)\]where $P(X_1=x_1, \ldots, X_K=x_k)$ is the probability that $X_i$ will be equal to $x_i$, simultaneously for all $i=1,\ldots, K\,.$
Marginal PMFs
The marginal PMF of $X$ is given by
\[\begin{align*} P_X(x) &= P(X=s) \\ &= \sum_{y_j\in R_Y} P(X=x, Y=y_j) \quad \text{(Law of total probability)} \\ &= \sum_{y_j\in R_Y} P_{XY}(x, y_j) \end{align*}\]where $R_Y={y_1,y_2,\ldots}$ is the range of $Y$.
Similarly, the marginal PMF of $Y$ is given by
\[P_Y(y) = \sum_{x_i\in R_X} P_{XY}(x_i, y)\]where $R_X={x_1,x_2,\ldots}$ is the range of $X$.
We can define the joint range for $X$ and $Y$ as
\[R_{XY} = \{(x,y) | P_{XY}(x,y) >0 \} \, .\]Equivalently, we can also write
\[\begin{aligned} R_{XY} &\subset R_X \times R_Y \\ &= \{(x_i,y_j) | x_i\in R_X, y_j\in R_Y \} \, . \end{aligned}\]Note that
- The event $X=x$ can be written as ${(x_i,y_j): x_i=x, y_j \in R_Y }\, .$
- Also, the event $Y=y$ can be written as ${(x_i,y_j): x_i\in R_X, y_j=y}\, .$
It is called marginalization or integrating out $X_2$ to get the marginal of $X_1$.
The marginal of one variable can be obtained by integrating the other variable. This applies to $n$-dimension.
Marginalizaing the joint density w.r.t. $X_1$ to get the marginal of $X_2$.
\[f_{X_2}(k) = \left\{ \begin{array}{ll} \displaystyle \int_{-\infty}^\infty f(x_1,k) \, \mathrm dx_1 & \text{continuous} \\ \displaystyle \sum_{x_1} f(X_1=x_1,X_2=k) & \text{discrete} \end{array} \right.\]where $\sum_{x_1}$ and $\sum_{x_2}$ mean sum over all values of $x_1$ and $x_2$, respectively.
Marginal CDFs
If we know the joint CDF of $X_1$ and $X_2$, we can find the marginal CDFs, $F_{X_1}(x_1)$ and $F_{X_2}(x_2)$. Specifically, for any $x \in \mathbb{R}$, we have
\[F_{X_1}(x_1) = P(X_1\le x_1) = P(X_1\le x_1, X_2\le \infty) = F_{X_1X_2}(x_1,\infty) = \lim_{x_2\to\infty} F_{X_1X_2}(x_1,x_2) \, .\]Similarly,
\[F_{X_2}(x_2) = P(X_2\le x_2) = P(X_1\le \infty, X_2\le x_2) = F_{X_1X_2}(\infty,x_2) = \lim_{x_1\to\infty} F_{X_1X_2}(x_1,x_2) \, .\]Example
Suppose we have a pair of discrete random variable ${(X, Y )}$, with an associated joint probability tabulated mass, $f_{XY} (x,y)$, as tabulated below.
$X$ | ||||
---|---|---|---|---|
1 | 2 | 3 | ||
$Y$ | -1 | 0.1 | 0.1 | 0.0 |
0 | 0.2 | 0.0 | 0.3 | |
2 | 0.1 | 0.2 | 0.0 |
The sum of all probabilities should be equal to 1.
\[\sum_{i}\sum_{j} f_{X,Y} (x_i, y_j) = 1\]To calculate the marginal masses, we sum along either the rows or the columns, respectively.
Summing along the rows ($X$) gives us the marginal probability of $Y$.
\[\begin{aligned} f_Y(Y=-1) &= \sum_{i=1}^3 f_{X,Y}(x_i, -1) = 0.1+0.1+0 = 0.2 \\ f_Y(Y=0) &= \sum_{i=1}^3 f_{X,Y}(x_i, 0) = 0.2+0+0.3 = 0.5 \\ f_Y(Y=2) &= \sum_{i=1}^3 f_{X,Y}(x_i, 2) = 0.1+0.2+0 = 0.3 \end{aligned}\]While summing along the columns ($Y$) gives us the marginal probability of $X$.
\[\begin{aligned} f_X(X=1) &= \sum_{j=1}^3 f_{X,Y}(1, y_j) = 0.1+0.2+0.1 = 0.4 \\ f_X(X=2) &= \sum_{j=1}^3 f_{X,Y}(2, y_j) = 0.1+0+0.2 = 0.3 \\ f_X(X=3) &= \sum_{j=1}^3 f_{X,Y}(3, y_j) = 0+0.3+0 = 0.3 \end{aligned}\]Then the calculation of expectations is straightforward:
\[\begin{aligned} \mathbb E(X) &= \sum_i x_i f_X(x_i) = 1\times 0.4 + 2\times 0.3 + 3\times 0.3 = 1.9 \\ \mathbb E(Y) &= \sum_i y_i f_Y(y_i) = -1\times 0.2 + 0\times 0.5 + 2\times 0.3 = 0.4 \end{aligned}\]so that, $\mathbb E (X)\, \mathbb E (Y)= 0.76\,.$
After some calculation, summing over the entire table:
\[\mathbb E(XY) = \sum_{i,j}x_i y_j f_{X,Y}(x_i, y_j) =0.7 \,.\]Since $\mathbb E(XY)=0.7\neq \mathbb E(X)\mathbb E(Y)$, thus $X$ and $Y$ are correlated.
Independence
Two random variables $X_1$ and $X_2$ are independent if and only if
\[F_{X_1X_2}(x_1, x_2) = F_{X_1}(x_1) F_{X_2}(x_2)\]where $F_{X_1X_2}(x_1, x_2)$ is their joint distribution function and $F_{X_1}(x_1)$ and $F_{X_2}(x_2)$ are their marginal distribution function.
The joint probability density function, $f_{X_1X_2}(x_1, x_2),$ is also the product of their marginal probability density functions.
\[f_{X_1X_2}(x_1, x_2) = f_{X_1}(x_1) f_{X_2}(x_2)\]Conditioning
Conditional density:
\[\begin{aligned} f_{X_1\vert X_2}(X_1=x_1\vert X_2=x_2) &= \frac{f_{X_1 X_2}(X_1=x_1, X_2=x_2)}{f_{X_2}(X_2=x_2)} \\ f_{X_2\vert X_1}(X_2=x_2\vert X_1=x_1) &= \frac{f_{X_1 X_2}(X_1=x_1, X_2=x_2)}{f_{X_1}(X_1=x_1)} \end{aligned}\]Note that nominator is the joint density, denominator is the marginal density.
Conditional pdf could be written more succinctly as
or we can drop the subscripts all together
\[\begin{aligned} f(x_1\vert x_2) &= \frac{f(x_1, x_2)}{f(x_2)} \\ f(x_2\vert x_1) &= \frac{f(x_1, x_2)}{f(x_1)} . \end{aligned}\]Here, we use the lower case letters to denote the actual values the RVâs take.
Multiplication Rule
\[\begin{aligned} f(x_1, x_2) &= f(x_1\vert x_2)\, f(x_2) \\ &= f(x_2\vert x_1)\, f(x_1) \end{aligned}\]Conditional cdf is integrated based on conditional pdf:
\[\begin{aligned} F_{X_1\vert X_2}(x_1\vert X_2=x_2) &= P(X_1\le x_1 \vert X_2=x_2) \\ &= \left\{ \begin{array}{ll} \int_{-\infty}^{x_1} f_{X_1\vert X_2}(s {\color{#008B45FF}\vert X_2=x_2}) \, \mathrm ds & \text{continuous} \\ \sum_{s\le x_1} f_{X_1\vert X_2}(X_1=s {\color{#008B45FF}\vert X_2=x_2}) &\text{discrete} \end{array} \right. \end{aligned}\]or could be written more succinctly as
\[\begin{aligned} F(x_1\vert x_2) &= P(X_1\le x_1 \vert X_2=x_2) \\ &= \left\{ \begin{array}{ll} \int_{-\infty}^{x_1} f(s{\color{#008B45FF}\vert x_2})\, \mathrm ds & \text{continuous} \\ \sum_{s\le x_1} f(X_1=s {\color{#008B45FF}\vert X_2=x_2}) &\text{discrete} \end{array} \right. \end{aligned}\]Expectation
\[\mathbb{E}[h(X,Y)] = \int_{-\infty}^\infty\int_{-\infty}^\infty h(x,y)\, f_{XY}(x,y) \, \mathrm dx \mathrm dy\]Conditional expectation on $X$ is indicated using a subscript in $\mathbb{E}_X$, e.g.,
\[\mathbb{E}_X[h(X, Y)] = \mathbb{E}[h(X, Y)|X] = \int_{-\infty}^\infty h(x,y) f_{h(X,Y)|X} (h(x,y)|x) \mathrm dy\]Here, we âintegrate outâ the $Y$ variable, and we are left with a function of $X$.
It is also possible the subscript indicates marginal density.
\[\mathbb{E}_X[h(X, Y)] = \int_{-\infty}^\infty h(x,y) f_X(x) \mathrm dx\]Here, we âaverage overâ the $X$ variable, and we are left with a function of $Y$.
Conditional expectation for several variables
We extend to three variables.
We have the following probability mass function:
\[P(X=x, Y=y, Z=z) = P(x,y,z) \, ,\]where we use some shorthand, $P(X=x_i, Y=y_j \vert Z=z_k) = P(x,y\vert z)$.
Joint marginal mass is given by
\[\begin{aligned} P(x,y) &= \sum_z P(x,y,z) \\ P(x,z) &= \sum_y P(x,y,z) \\ P(y,z) &= \sum_x P(x,y,z) \end{aligned} .\]The joint conditional probability is given by
\[P(x,y|z) = \frac{P(x,y,z)}{P(z)} \, ,\]and the conditional probability.
\[P(x|y,z) = \frac{P(x,y,z)}{P(y,z)} \, .\]We have the following useful lemmas.
- Linearity:
-
Pull-through rule:
\[\mathbb E\left[g(X)\,h(Y)\vert Y\right] = h(Y)\, \mathbb E\left[g(X)\vert Y\right]\]Proof:
\[\begin{align*} E\left[g(X)h(Y)\vert Y\right] &= \sum_x g(x) h(y) P(x|y) \\ &= h(y) \sum_x g(x) P(x|y) \\ &= h(y)\, \mathbb E[g(X)|Y=y] \tag*{\(\square\)} \end{align*}\] -
Tower rule:
\[\mathbb E\left[ \mathbb E\left(Z\vert X,Y\right) \vert Y\right] = \mathbb E \left[ \mathbb E(Z\vert Y) \vert Y, X\right ]= \mathbb E[Z\vert Y]\]This is a generalization of the conditional expectation theorem.
Proof:
\[\begin{align*} E\left[ \mathbb E\left(Z\vert X,Y\right) \vert Y\right] &= \sum_x \mathbb E\left(Z\vert X,Y\right) P(x|y) \\ &= \sum_x \left[\sum_z z P(z|x,y)\right] P(x|y) \\ &= \sum_x \sum_z z \, \frac{P(x,y,z)}{P(x,y)} \times \frac{P(x,y)}{P(y)} \\ &= \sum_x\sum_z z\, \frac{P(x,y,z)}{P(y)} \\ &= \sum_x\sum_z z\, P(x,z|y) \\ &= \sum_z z\, P(z|y) \\ &= \mathbb E(Z|Y) \end{align*}\]
Moments
We often summarize properties of distributions using their moments.
The $r^{\text{th}}$ order moment is defined by
\[\begin{aligned} \mathbb{E}(X^r) = \left\{ \begin{array}{ll} \int_{-\infty}^\infty x^r f(x)dx & \text{continuous} \\ \sum_x x^r f(x) &\text{discrete} \end{array} \right. \end{aligned}\]First order moment
The first moment is called the expected value or expectation, which is given by
\[\begin{aligned} \mathbb{E}(X) = \left\{ \begin{array}{ll} \int_{-\infty}^\infty x f(x)dx & \text{continuous} \\ \sum_x x f(x) &\text{discrete} \end{array} \right. \end{aligned}\]Second moment about the mean
Also called second central moment. The variance is obtained by setting $g(X)=\left[X-\mathbb{E}(X)\right]^2.$
\[\begin{aligned} \text{Var}(X) = \mathbb{E}[\left(X-\mathbb{E}(X)\right)^2] = \left\{ \begin{array}{ll} \int_{-\infty}^\infty \left[X-\mathbb{E}(X)\right]^2 f(x)dx & \text{continuous} \\ \sum_x \left[X-\mathbb{E}(X)\right]^2 f(x) &\text{discrete} \end{array} \right. \end{aligned}\]Unconditional expectation of functions of RVs
\[\begin{aligned} \mathbb{E}[g(x)] = \left\{ \begin{array}{ll} \int_{-\infty}^\infty g(x){\color{red}f(x)}dx & \text{continuous} \\ \sum_x g(x){\color{red}f(x)} &\text{discrete} \end{array} \right. \end{aligned}\]Conditional expectation of functions of RVs
\[\begin{aligned} \mathbb{E}(g(x) \vert Y=y) = \left\{ \begin{array}{ll} \int_{-\infty}^\infty g(x){\color{#008B45FF}f(x\vert Y=y)}dx & \text{continuous} \\ \sum_x g(x){\color{#008B45FF}f(x\vert Y=y)} &\text{discrete} \end{array} \right. \end{aligned}\]Note:
- For unconditional moments, use the appropriate unconditional density.
- For conditional moments, use the appropriate conditional density.
- Expectation or expected value is a population quantity because it requires knowledge of the density function.
- The sample analogue of the expected value is the sample mean or sample average.
Conditional and Unconditional Variance
Unconditional
\[\begin{aligned} \text{Var}(X) &= \mathbb{E} [(X-\mathbb{E}(X))^2] \\ &= \left\{ \begin{array}{ll} \int_{-\infty}^{\infty} \left(x-\mathbb{E}(X) \right)^2 f(x) dx & \text{continuous} \\ \sum_{x} \left(x-\mathbb{E}(X) \right)^2 f(x) &\text{discrete} \end{array} \right. \end{aligned}\]Conditional variance use conditional expectations
\[\begin{aligned} \text{Var}(X{\color{#008B45FF}\vert Y}) &= \mathbb{E} [(X-\mathbb{E}(X{\color{#008B45FF}\vert Y}))^2 {\color{#008B45FF}\vert Y}] \\ &= \left\{ \begin{array}{ll} \int_{-\infty}^{\infty} [x-\mathbb{E}(X{\color{#008B45FF}\vert Y=y})]^2 f(x{\color{#008B45FF}\vert y}) dx & \text{continuous} \\ \sum_{x} [x-\mathbb{E}(X{\color{#008B45FF}\vert Y=y})]^2 f(x{\color{#008B45FF}\vert y}) &\text{discrete} \end{array} \right. \end{aligned}\]Alternatively, the conditional variance can be written as
\[\text{Var}(X{\color{#008B45FF}\vert Y}) = \mathbb{E}[X^2{\color{#008B45FF}\vert Y}] - \left(\mathbb{E}[X{\color{#008B45FF}\vert Y} ]\right )^2\]Independence conditional on other variables
$X_1$ and $X_2$ are said to be independent conditional on $X_3$ if
\[f(x_1, x_2 \vert x_3) = f(x_1\vert x_3) f(x_2\vert x_3)\]The left-hand-side (LHS) represents the joint behavior of $X_1$ and $X_2$ conditional on $X_3$, the RHS represents the individual behavior conditional on $X_3$.
This is denoted as
\[X_1 \indep X_2 \vert X_3\]Note that this does not imply $X_1 \indep X_2$!
E.g., $X_1$ and $X_2$ can be returns on two equities where $X_3$ is some global macroeconomic factor affecting multiple variables at once (e.g. federal reserve interest rate).
Another example for $X_1, X_2$ would be wages and level of education, whereas $X_3$ is level of intelligence.
For three or more RVs, the joint PDF, joint PMF, and joint CDF are defined in a similar way to what we have seen for the case of two random variables.
Let $X_1, X_2, \ldots, X_n$ be $n$ discrete RVs. The joint PMF is defined as
For $n$ jointly continuous RVs $X_1, X_2, \cdots, X_n$, the joint PDF is defined to be the function $f_{X_1, X_2, \cdots, X_n}(x_1, x_2, \cdots, x_n)$ such that the probability of any set $A\in \mathbb{R}^n$ is given by the integral of the PDF over the set $A$. In particular, we can write
\[P\big((X_1, X_2, \cdots, X_n)\in A\big) = \int \cdots \underset{A}{\int} \cdots \int f_{X_1, X_2, \cdots, X_n}(x_1, x_2, \cdots, x_n) dx_1dx_2\cdots dx_n.\]The moment-generating function (MGF) of a random variable $X$ is
\[M_X(t) = E(e^{tX}) = \begin{cases} \displaystyle \int_{-\infty}^{\infty} e^{tx} f_X(x) \mathrm ds & \text{continuous } X \\ \displaystyle \sum_{i=0}^\infty e^{tx_i} P(X=x_i) & \text{discrete } X \end{cases}\]-
The MGF of X gives us all moments of X. That is why it is called the moment generating function.
We can obtain all moments of $X$:
\[\mu_n = E[X^n] = M_X^{(n)} (0) = \left. \frac{\mathrm d^n M_X}{\mathrm dt^n} \right\vert_{t=0} \,.\]That is, the $n$th moment about zero is the $n$th derivative of the moment generating function, evaluated at $t=0.$
-
The MGF (if it exists) uniquely determines the distribution.
Linear Algebra
https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab
Change of Variable Formula
Given the distribution of $X$, we can obtain the distribution of a continuous function of $X$, e.g. $Y=g(X)$.
\[\begin{align*} F_Y(y)&=P(Y\le y)=P\left(g(X)\le y\right) \\ &=\left\{ \begin{array}{ll} P\left(X\le g^{-1}(y)\right) = F_X(g^{-1}(y)) & \text{when $g(x)$ is $\uparrow$} \\ P\left(X\ge g^{-1}(y)\right) = 1- F_X(g^{-1}(y)) & \text{when $g(x)$ is $\downarrow$} \end{array} \right. \\ f_Y(y) &= f_X(g^{-1}(y)) \cdot \left\vert \frac{\partial }{\partial y} g^{-1}(y) \right\vert \end{align*}\]Change of Variable for a Double Integral
Let $X$ and $Y$ be two jointly continuous random variables. Let $(Z, W) = g(X,Y) = (g_1(X,Y), g_2(X,Y))$, where $g$: $\mathbb{R}^2 \mapsto \mathbb{R}^2$ is a continuous one-to-one (invertible) function with continuous partial derivatives.
Let $h=g^{-1}$, i.e., the inverse function that takes $(Z,W)$ and returns $(X,Y)$, $(X,Y)=h(Z,W)=(h_1(Z,W),h_2(Z,W))$.
Then $Z$ and $W$ are jointly continuous and their joint PDF, $f_{ZW}(z,w)$, for $(z,w)\in R_{ZW}$ is given by
\[f_{ZW}(z,w)=f_{XY}(h_1(z,w),h_2(z,w)) \cdot \vert \boldsymbol{J}\vert,\]where $\boldsymbol{J}$ is the Jacobian of $h$, its determinant is defined by
\[\abs{\boldsymbol{J}} = \text{det} \begin{bmatrix} \frac{\partial h_1}{\partial z} & \frac{\partial h_1}{\partial w} \\ \frac{\partial h_2}{\partial z} & \frac{\partial h_2}{\partial w} \end{bmatrix} = \frac{\partial h_1}{\partial z} \cdot \frac{\partial h_2}{\partial w} - \frac{\partial h_1}{\partial w} \cdot \frac{\partial h_2}{\partial z}\]Let $X$ and $Y$ be two random variables with joint PDF $f_{XY}(x,y)$. Let $Z=X+Y$. Find $f_Z(z)$.
To apply the change of variable, we need two random variables $Z$ and $W$. Define
\[\begin{align*} \left\{ \begin{array}{ll} z = x+y & \text{i.e., }g_1(x,y)\\ w = x & \text{i.e., }g_2(x,y) \end{array} \right. \end{align*}\]Then we can find the inverse transform:
\[\begin{align*} \left\{ \begin{array}{ll} x = w & \text{i.e., }h_1(x,y)\\ y = z-w & \text{i.e., }h_2(x,y) \end{array} \right. \end{align*}\]Then, we have absolute value of the Jacobian
\[\vert \boldsymbol{J} \vert = \vert \text{det} \begin{bmatrix} 0 & 1 \\ 1 & -1 \end{bmatrix} \vert = \vert 0-1 \vert = 1\]Thus,
\[f_{ZW}(z,w)=f_{XY}(w,zâw).\]But since we are interested in the marginal PDF, $f_Z(z)$, we have
\[f_Z(z)=\int_{-\infty}^{\infty}f_{XY}(w,zâw)dw.\]Note that, if $X$ and $Y$ are independent, then $f_{XY}(x,y)=f_X(x)fY(y)$ and we conclude that \(f_Z(z)=\int_{-\infty}^{\infty}f_X(w)f_Y(zâw)dw.\)
The above integral is called the convolution of $f_X$ and $f_Y$, and we write
\[f_Z(z)=f_X(z)*f_Y(z) = \int_{-\infty}^{\infty}f_X(w)f_Y(zâw)dw = \int_{-\infty}^{\infty}f_Y(w)f_X(zâw)dw.\]Let $X$ and $Y$ be two independent discrete random variables. Denote their respective pmfs (probability mass function) by $p_X(x)$ and $p_Y(y)$, and their supports by $R_X$ and $R_{Y}$. Let
\[Z=X+Y\]and denote the pmf of $Z$ by $p_Z(z)$. Then,
\[\begin{align*} p_Z(z) &= \sum_{k=-\infty}^{\infty} p_X(x=k)\cdot p_Y(y=z-k) \\ &= \sum_{x\in R_X} p_X(x=k)\cdot p_Y(y=z-k) \end{align*}\]or
\[p_Z(z) = \sum_{y\in R_y} p_X(x=z-k)\cdot p_Y(y=k).\]Big-O Little-o Notation
Consider a sequence of random variables $X_n$ and a sequence of constants $a_n$ for $n = 1, 2, . . .$
If $X_n/a_n \xrightarrow{p} 0$, we say $(X_n/a_n)=o_p(1)$ or $X_n=o_p(a_n)$.
Consequently:
- If $X_n\xrightarrow{p}0$, we say $X_n=o_p(1)$.
- If $n^\alpha X_n\xrightarrow{p}0$ for some $\alpha$, we say $X_n/n^{-\alpha}=o_p(1)$ or $X_n=o_p(n^{-\alpha})$.
E.g., For $X_1, X_2, \ldots, X_n$ iid with mean $\mu$ and variance $\sigma^2$, by the LLN we have $\overline{X}\xrightarrow{p}\mu$. Then, $\overline{X}-\mu\xrightarrow{p}0$ and so $\overline{X}-\mu = o_p(1)$.
Big-O notation relaxes the convergence to a finite limit (zero or non-zero).
If $X_n/a_n \xrightarrow{d} X$ or $X_n/a_n \xrightarrow{p} X$, we say $(X_n/a_n)=O_p(1)$ or $X_n=O_p(a_n)$.
Consequently:
- If $X_n\xrightarrow{d}X$ or $X_n\xrightarrow{p}X$, we say $X_n=O_p(1)$.
- If $n^\alpha X_n\xrightarrow{d}X$ or $n^\alpha X_n\xrightarrow{p}X$ for some $\alpha$, we say $X_n/n^{-\alpha}=O_p(1)$ or $X_n=O_p(n^{-\alpha})$.
E.g., For $X_1, X_2, \ldots, X_n$ iid with mean $\mu$ and variance $\sigma^2$, define $Z=\frac{\overline{X}-\mu}{\sigma}$. Then, by the CLT we have $\sqrt{n}Z\xrightarrow{d}N(0,1)$ and so $\sqrt{n}Z = O_p(1)$ or equivalently $Z=O_p(n^{-1/2})$.
Type I and II Errors
Type I error: rejecting a true $H_0$. Corresponds to the level of significance, $\alpha$,
\[\alpha = P(\text{reject } H_0 \vert H_0 \text{ is true})\,.\]Type II error: failing to reject a false $H_0$. The probability of committing a Type II error is called $\beta\,.$
\[\beta=P(\text{fail to reject} H_0\vert H_0 \text{ is false})\,.\]$\beta$ is related to the Power of a test. $\beta = 1-\text{Power of a test} = 1-P(\text{reject } H_0\vert H_0 \text{ is false})\,.$
In hypothesis testing, the size of a test is the (maximum) probability of committing a Type I error, that is, of incorrectly rejecting the null hypothesis when it is true.
The power of a test refers to the probability of correctly rejecting $H_0$ when $H_1$ is true.
Note:
- If $\alpha$ increases that means the chances of making a type I error will increase. It is more likely that a type I error will occur. It makes sense that you are less likely to make type II errors, only because you will be rejecting $H_0$ more often. You will be failing to reject $H_0$ less, and therefore, the chance of making a type II error will decrease. Thus, as $\alpha$ increases, $\beta$ will decrease, and vice versa. That makes them seem like complements, but they are not complements.
- For a constant sample size, $n$, if $\alpha$ increases, $\beta$ decreases.
For a constant significance level, $\alpha$ , if $n$ increases, $\beta$ decreases.
Confusion matrix
Precision is the proportion of all the modelâs positive classifications that are actually positive.
\[\text{Precision} = \frac{\text{correctly classified actual positives}} {\text{everything classified as positive}} = \frac{TP}{TP+FP}\]The True Positive Rate (TPR), or the proportion of all actual positives that were classified correctly as positives, is also known as recall.
A hypothetical perfect model would have zero false negatives and therefore a recall (TPR) of 1.0, which is to say, a 100% detection rate.
In an imbalanced dataset where the number of actual positives is very low, recall is a more meaningful metric than accuracy because it measures the ability of the model to correctly identify all positive instances. For applications like disease prediction, correctly identifying the positive cases is crucial. A false negative typically has more serious consequences than a false positive.
\[\text{Recall (or True Positive Rate, TPR)} = \frac{\text{correctly classified actual positives}} {\text{all actual positives}} = \frac{TP}{TP+FN}\]The False Positive Rate (FPR) is the proportion of all actual negatives that were classified incorrectly as positives, also known as the probability of false alarm.
\[\text{FPR} = \frac{\text{incorrectly classified actual negatives}} {\text{all actual negatives}} = \frac{FP}{FP+TN}\]Precision improves as false positives decrease, while recall improves when false negatives decrease.
Receiver-operating characteristic curve (ROC)
The ROC curve is a visual representation of model performance across all thresholds.
The ROC curve is drawn by calculating the true positive rate (TPR) and false positive rate (FPR) at every possible threshold (in practice, at selected intervals), then graphing TPR over FPR.
A perfect model, which at some threshold has a TPR of 1.0 and a FPR of 0.0, can be represented by either a point at (0, 1) if all other thresholds are ignored, or by the following:

Toggle thresholds and see how metrics and ROC curve change: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc#auc_and_roc_for_choosing_model_and_threshold
The area under the ROC curve (AUC) represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative.

For a binary classifier, a model that does exactly as well as random guesses or coin flips has a ROC that is a diagonal line from (0,0) to (1,1). The AUC is 0.5, representing a 50% probability of correctly ranking a random positive and negative example.
The points on a ROC curve closest to (0,1) represent a range of the best-performing thresholds for the given model.

If false positives (false alarms) are highly costly, it may make sense to choose a threshold that gives a lower FPR, like the one at point A, even if TPR is reduced. Conversely, if false positives are cheap and false negatives (missed true positives) highly costly, the threshold for point C, which maximizes TPR, may be preferable. If the costs are roughly equivalent, point B may offer the best balance between TPR and FPR.
A concrete example: Imagine a situation where itâs better to allow some spam to reach the inbox than to send a business-critical email to the spam folder. Youâve trained a spam classifier for this situation where the positive class is spam and the negative class is not-spam. In this use case, itâs better to minimize false positives, even if true positives also decrease. Choose point A.
Q: When should we use one-tailed hypothesis testing?
A: Authors should explain why they are more interested in an effect in one direction not the other.
Ex1: we compare the mean strength of parts from a supplier (102) to a target value (100). We are considering a new supplier only if the mean strength of their parts is greater than our target value. There is no need for us to distinguish between whether their parts are equally strong or less strong than the target value â either way weâd just stick with our current supplier.
$H_0$: new supplier = target value
$H_1$: new supplier > target value
Ex1: We want to know if the battery life is greater than the original after a manufacturing change.
$H_0$: new battery life = original life
$H_1$: new battery life > original life
- One-tailed test improve the power of a test, that is correctly reject $H_0$ if the null hypothesis is truly false.
log-transformed Models
- Only the dependent/response variable is log-transformed. Exponentiate the coefficient. This gives the multiplicative factor for every one-unit increase in the independent variable. Example: the coefficient is 0.198. exp(0.198) = 1.218962. For every one-unit increase in the independent variable, our dependent variable increases by a factor of about 1.22, or 22%. Recall that multiplying a number by 1.22 is the same as increasing the number by 22%. Likewise, multiplying a number by, say 0.84, is the same as decreasing the number by 1 â 0.84 = 0.16, or 16%.
- Only independent/predictor variable(s) is log-transformed. Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units. Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002. For x percent increase, multiply the coefficient by log(1.x). Example: For every 10% increase in the independent variable, our dependent variable increases by about 0.198 * log(1.10) = 0.02.
- Both dependent/response variable and independent/predictor variable(s) are log-transformed. Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%. For x percent increase, calculate 1.x to the power of the coefficient, subtract 1, and multiply by 100. Example: For every 20% increase in the independent variable, our dependent variable increases by about (1.20 0.198 â 1) * 100 = 3.7 percent.
Semi-parametric models
- Method of Moments
- quantile regression
Textbooks
-
Econometric Analysis , 5th Edition, by William H. Greene, Prentice Hall, 2003.
Time Series Analysis, by J. D. Hamilton, Princeton University Press, 1994.
Estimation and Inference in Econometrics, by R. Davidon and J. MacKinnon, Oxford University Press, 1993.
Econometric Analysis of Cross Section and Panel Data, by J. Wooldridge, MIT Press, 1999.
Microeconometrics: Methods and Applications, Cameron, A. C. and Trivedi, P. K., Cambridge University Press, 2005. -
Core Metrics textbooks
- Casella, G. and Berger, R.L. (2002) Statistical Inference. 2nd ed. Duxbury.
- Hendry, D.F. and Nielsen, B. (2007) Econometric Modeling. Princeton.
- Hoel, P.G., Port, S.C. and Stone, C.J. (1971) Introduction to Probability. Boston: Houghton-Mifflin.
-
Resources
- Fundamental Theorems for Econometrics: https://bookdown.org/ts_robinson1994/10EconometricTheorems/dm.html