↑

Basic Probability Properties

  • Probability of a union

    \[P(A \cup B) = P(A) + P(B) - P(A \cap B)\]

    If $A$ and $B$ are mutually exclusive, then

    \[\begin{split} P(A \cap B) &= 0 \\ P(A \cup B) &= P(A) + P(B). \end{split}\]

    This is also called the addition rule.

  • Conditional probability

    \[P(A|B) = \frac{P(A \cap B)}{P(B)}\]

    If A and $B$ are independent, then $P(A\vert B) = P(A)$ and $P(B\vert A) = P(B)$.

  • Multiplication rule

    \[P(A \cap B) = P(A|B) P(B) = P(B|A) P(A)\]

    If A and $B$ are independent, then

    \[P(A \cap B) = P(A) P(B) .\]
  • The partition rule

    If $B_1, B_2, \ldots, B_m$ form a partition of the sample space $\Omega$, then

    \[P(A) = \sum_{i=1}^m P(A \cap B_i)= \sum_{i=1}^m P(A|B_i) P(B_i)\]

    for any event $A$.

    As a special case, $B$ and $\overline{B}$ partition $\Omega,$ so

    \[\begin{split} P(A) &= P(A\cap B) + P(A\cap \overline{B}) \\ &= P(A|B) P(B) + P(A|\overline{B}) P(\overline{B}). \end{split}\]
  • Bayes’ theorem

\[P(B_j|A) = \frac{P(A|B_j) P(B_j)}{\sum_{i=1}^m P(A|B_i) P(B_i)}\]
  • Chains of events

    If $A_1, A_2, \ldots, A_n$ are events in the sample space $\Omega$, then

    \[P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1) P(A_2|A_1) P(A_3|A_1 \cap A_2) \cdots P(A_n|A_1 \cap A_2 \cap \cdots \cap A_{n-1}).\]

Expectation

Expectation are denoted by $\mathbb{E}(X)$ or $\mathbb{E}_X(X)$ to denote the expectation is taken over the RV $X.$

\[\mathbb{E}(X) = \begin{cases} \displaystyle \sum_{x}x f(x) & \text{for discrete } X \\ \displaystyle \int_{-\infty}^{\infty} x f(x)\, \mathrm dx & \text{for continuous } X \\ \end{cases}\]

Conditional Expectations

The conditional expectation for $Y$, given that $X$ has a prescribed value, is defined as follows:

\[\mathbb E(Y|X=x) = \sum_j y_j P(Y=y_j|X=x) = \sum_j y_j f_{Y|X}(x,y_i) \, ,\]

which is a function of $X.$

The conditional expectation for $X$ given $Y$, $\mathbb E(X\vert Y=y)$, is defined as follows:

\[\mathbb E(X\vert Y=y) = \sum_i x_i P(X=x_i\vert Y=y) = \sum_i x_i f_{X\vert Y}(x_i,y) \, ,\]

which is a function of $Y.$

We denote the conditional expectation for $Y$ given $X$ as follows:

\[\varphi_X(x) = \mathbb E(Y|X=x) \, .\]

Technically, this is termed the regression function of $Y$ on $X$.

The expectation of the regression function is:

\[\begin{aligned} \mathbb E(\varphi_X(x)) &= \sum_i \varphi_X(x_i) f_X(x_i) \\ &= \sum_i \left\lbrace \sum_j y_j \, \frac{f_{XY}(x_i,y_j)}{f_X(x_i)} \right\rbrace f_X(x_i) \\ &= \sum_j y_j \sum_i f_{XY}(x_i,y_j) \\ &= \sum_j y_j f_Y(y_j) \\ &= \mathbb E(Y) \,. \end{aligned}\]

Law of Iterated Expectations (LIE)

Fora any random variable $X\in \mathcal X$ and $Y\in \mathcal Y \subset \mathbb R$,

\[\begin{align*} \mathbb{E}_Y(Y) = \mathbb{E}_X\left[\mathbb{E}_{Y\vert X}\right(Y\vert X=x\left)\right] = \sum_{x\in\mathcal X} \mathbb{E}[Y\vert X=x]\, \mathbb{P}(X=x) . \end{align*}\]

(the subscript in $\mathbb{E}_Y$ and $\mathbb{E}_X$ indicate on which variable the expectation is taken) or more succinctly

\[\begin{align*} \mathbb{E}[Y] = \mathbb{E}\left[\mathbb{E}\right(Y\vert X\left)\right]. \end{align*}\]

It represents a transformation from conditional to unconditional expectation. The expected value (this expectation is with respect to $X$) of the conditional expectation of $Y$ given $X$ is the expected value of $Y$.

LIE is also called the law of total expectation, which can be derived from the law of total probability. We will see this in what follows. LIE is also referred to as “Adam’s Law.”


Suppose $\mathbb{E}(Y\vert X)=0$, then

\[\begin{align*} 1. \quad & \mathbb{E}[Y] = \mathbb{E}\left[\mathbb{E}\right(Y\vert X=x\left)\right] = 0 \\ 2. \quad & \mathbb{E}[g(X)\cdot Y] = \mathbb{E}\left[\mathbb{E}[g(X)\cdot Y\vert X]\right] = \mathbb{E}\left[g(X)\cdot \mathbb{E}[Y\vert X]\right] = \mathbb{E}[g(X)\cdot0]=0 \end{align*}\]

Note that we start from the conditional expectation being $0$, and conclude that unconditional expectation and its product with a function of $X$ are $0$’s too.
Conditional expectation is a stronger condition than unconditional expectation.

In other words, although conditional expectations can vary arbitrarily for different values of $X$, if you know what the conditional expectations are, the overall expected value of $Y$ is fully determined.

A simple example is one where $X$ takes only two values. Suppose we are interested in mean birthweight ($Y$) for children of mother’s who either drank alcohol during their pregnancy ($X=1$) or who didn’t drink alcohol during their pregnancy ($X=0$).

Suppose the following numbers:

\[\begin{aligned} \mathbb{E}[Y|X=1] &= 7, \;\;\; \mathbb{P}(X=1) = 0.1\\ \mathbb{E}[Y|X=0] &= 8, \;\;\; \mathbb{P}(X=0) = 0.9 \\ \end{aligned}\]

The law of iterated expectation says that

\[\begin{aligned} \mathbb{E}[Y] &= \mathbb{E}_X \big[ \mathbb{E}[Y|X] \big] \\ &= \sum_{x \in \mathcal{X}} \mathbb{E}[Y|X=x] \cdot \mathbb{P}(X=x) \\ &= \mathbb{E}[Y|X=0] \cdot \mathbb{P}(X=0) + \mathbb{E}[Y|X=1] \cdot \mathbb{P}(X=1) \\ &= (8) \times (0.9) + (7) \times (0.1) \\ &= 7.9 \end{aligned}\]

Expressing mathematically, suppose we have a sample space, $X={A, A^c}$, then

\[\mathbb E (Y) = \mathbb E (Y|A)\, P(A) + \mathbb E (Y|A^c)\, P(A^c)\]

Identities for conditional expectations

  • LIE (Adam’s Law) $\mathbb{E}[Y] = \mathbb{E}\left[\mathbb{E}\right(Y\mid X\left)\right].$
  • Generalized Adam’s Law

    \[\E\left[ \E[Y \mid g(X)] \mid f(g(X)) \right] = \E[Y\mid f(g(X))]\]

    for any $f$ and $g$ with compatible domains and ranges. We also have that

    \[\E\left[ \E[Y \mid g(X)] \mid f(g(X)=z) \right] = \E[Y\mid f(g(X))=z]\]
  • Independence: $\E[Y\mid X] = \E[Y]$ if $X$ and $Y$ are independent.

  • Taking out what is known: $\E[h(X)Z\mid X]=h(X)\E[Z\mid X]$

  • Linearity: $\E[aX+bY\mid Z] = a\E[X\mid Z] + b\E[Y\mid Z],$ for any $a, b\in \R.$

  • Projection interpretation: $\E\left[\left(Y-\E[Y\mid X]\right)h(X)\right] = 0$ for any function $h: \Xcal \to \R.$

  • Keeping just what is needed: $\E[XY] = \E[X\E[Y\mid X]]$ for $X, Y\in \R.$

More about LIE: https://davidrosenberg.github.io/ttml2021fall/background/conditional-expectation-notes.pdf


Partitioning and Conditioning

In the general case, consider a partition of the sample space: ${X_1, X_2, \ldots, X_n}$, each event has a corresponding probability: $P(X_1), P(X_2), \ldots, P(X_n)$.

Given another event $Y$, then according to the partition rule we have:

\[\begin{aligned} P(Y) &= \sum_i^n P(Y \cap X_i) \\ &= \sum_i^n P(Y \vert X_i)\, P (X_i)\\ &= P(Y|X_1)P(X_1) + P(Y|X_2)P(X_2) + \cdots + P(Y|X_n)P(X_n) . \end{aligned}\]

This is called the “law of total probability”.

Then it follows that:

\[\begin{aligned} \mathbb{E}(Y) &= \sum_j y_j P(Y=y_j) \\ &= \sum_j y_j \left[P(Y=y_j|X_1)P(X_1) + P(Y=y_j|X_2)P(X_2) + \cdots + P(Y=y_j|X_n)P(X_n) \right] \\ &= \sum_j y_jP(Y=y_j|X_1)\,P(X_1) + \sum_j y_jP(Y=y_j|X_2)\,P(X_2) + \cdots + \sum_j y_jP(Y=y_j|X_n)\,P(X_n) \\ &= \mathbb{E}(Y|X_1)\,P(X_1) + \mathbb{E}(Y|X_2)\,P(X_2) + \cdots + \mathbb{E}(Y|X_n)\,P(X_n) \\ &= \mathbb{E}\left[\mathbb{E}(Y|X)\right] \end{aligned}\]

A reverse proof starting from $\mathbb{E}\left[\mathbb{E}(Y\vert X)\right]$.

\[\begin{align*} \mathbb{E}\left[\mathbb{E}(Y|X)\right] &= \sum_{x\in\mathcal X} p(x)\, \mathbb{E}(Y|X=x) \\ &= \sum_{x\in\mathcal X} \left[p(x) \sum_{y\in\mathcal Y} y\, p(y|x) \right] \\ &= \sum_{y\in\mathcal Y} \left[ y \sum_{x\in\mathcal X} p(x,y) \right] \\ &= \sum_{y\in\mathcal Y} y\,p(y) \\ &= \mathbb E(Y) \tag*{\(\square\)} \end{align*}\]

In case of continuous variables, we have

\[\begin{align} \mathsf E(Y) &= \int_\Bbb R y\,f_Y(y)\,\mathrm d y && \text{by definition of expectation} \\[1ex] &= \int_\Bbb R y\int_\Bbb R f_{Y\mid X}(y\mid x)~f_X(x)\,\mathrm d x\,\mathrm d y &&\text{Law of Total Probability} \\[1ex] &= \int_\Bbb R f_X(x)\int_\Bbb R y~f_{Y\mid X}(y\mid x)\,\mathrm d y\,\mathrm d x &&\text{Fubini's Theorem } \\[1ex] &= \int_\Bbb R f_X(x)\,\mathsf E(Y\mid X{\,=\,}x)\,\mathrm d x && \text{by definition of conditional expectation} \\[1ex] &= \mathsf E\left[\mathsf E(Y\mid X)\right] && \text{by definition of expectation} \end{align}\]

Generalization of LIE in time series

\[\mathbb{E}\big[\mathbb{E}(y_{t+2}|x_{t+1}) |x_t \big] = \mathbb{E}[y_{t+2} |x_t]\]

as $x_{t} \subset x_{t+1}$.

More generally, for any random variable $z$ and two information sets $\mathcal{J}$ and $\mathcal{I}$ with $\mathcal{J} \subset \mathcal{I}$, then

\[\mathbb{E}\big[\mathbb{E}(x|\mathcal{I}) |\mathcal{J} \big] = \mathbb{E}[x |\mathcal{J}]\]

Intuition behind the LIE

Think of $\mathbf{x}$ as a discrete vector taking on possible values $\mathbf{c}_1, \mathbf{c}_2, \ldots, \mathbf{c}_M$ with probabilities $p_1, p_2, \ldots, p_M$. Then LIE says:

\[\mathbb{E}(y) = p_1 \mathbb{E}(y\vert \mathbf{x}=\mathbf{c}_1) + p_2 \mathbb{E}(y\vert \mathbf{x}=\mathbf{c}_2) + \cdots + p_M \mathbb{E}(y\vert \mathbf{x}=\mathbf{c}_M).\]

Ths is, $\mathbb{E}(y)$ is simply a weighted average of the $\mathbb{E}(y\vert \mathbf{x}=\mathbf{c}_i)$, where the weight $p_i$ is the probability that $\mathbf{x}$ takes on the value of $\mathbf{c}_i$. In other words, a weighted average of averages.

E.g., suppose we are interested in average IQ generally, but we have measures of average IQ by gender. We could figure out the quantity of interest by weighting average IQ by the relative proportions of men and women.

Bayes’ Theorem 贝叶斯定理

Partition Theorem (total expectation theorem, law of total probability) Intuition

\[\mathbb{P}(\text{eventual goal}) = \sum_{\text{options}} \mathbb{P}(\text{eventual goal}|\text{option})\, \mathbb{P}(\text{option})\]

Decomposition of variance

Or sometimes called Law of Total Variance.

\[\text{Var}(Y) = \text{Var}_\mathbf{x}[\mathbb{E}(y\vert \mathbf{x})] + \mathbb{E}_\mathbf{x}[\text{Var}(y\vert \mathbf{x})]\]

In plain language, the variance of $Y$ decomposes into the variance of the conditional mean plus the expected variance around the conditional mean.


Population and Sample

Population quantities need to know the DGP. Any thing you observe is for samples.

Expectation operator is a linear operator, meaning that we have

\[\begin{align*} \mathbb{E}(a+bX)=a+b\,\mathbb{E}(X) \end{align*}\]

More generally, let $a_1, \ldots, a_n$ and $b_1, \ldots, b_n$ be sequences of non-random variables and let $X_1, \ldots, X_n$ be a sequence of random variables. Then,

\[\begin{align*} \mathbb{E}\left[\sum_{i=1}^n(a_i+b_iX_i)\right] = \sum_{i=1}^n \mathbb{E}(a_i+b_iX_i) = \sum_{i=1}^n\left(a_i+b_i\mathbb{E}[X_i]\right). \end{align*}\]

Linear Transformations of a Random Vector

Let $Y=A+BX$, $X$ is $p\times 1$ random vector, $A$ is $q \times 1$ non-random vector, $B$ is $p\times q$ non-random matrix.

The expected value of this transformation is given by

\[\mathbb{E}(Y) = A + B\, \mathbb{E}(X)\]

The variance of this transformation is given by

\[\text{Var}(Y) = B\, \text{Var}(X)\, B'\]

A $q \times q$ square matrix $\Sigma$ is called non-negative definite (or positive semi-definite) if for any non-zero $q \times 1$ vector $\boldsymbol{a}$ it holds that

\[\boldsymbol{a}'\Sigma\, \boldsymbol{a} \ge 0\]

If the square matrix $\Sigma$ is non-negative definite, we write $\Sigma \ge 0$.

Note that all covariance matrix of a random vector is positive semi-definite.


Expectations of Functions of RVs

If $X$ is a RV, then expected value of $g(X)$ is given by

\[\begin{align*} \mathbb{E}[g(X)] = \left\{ \begin{array}{ll} \int g(x)f(x)dx & \text{for continuous $X$} \\ \sum_xg(x)f(x) & \text{for discrete $X$} \end{array} \right. \end{align*}\]

where $f(x)$ is the probability density (mass) function of continuous (discrete) $X$.

Variance is also an expectation by setting $g(X) = \left[X-\mathbb{E}(X)\right]^2$. In other words, $\text{Var}(X) = \mathbb{E}\left[(X-\mathbb{E}(X))^2\right]$.

\[\begin{align*} \text{Var}(X) = \mathbb{E}[(X-\mathbb{E}(X))^2] = \left\{ \begin{array}{ll} \int [x-\mathbb{E}(X)]^2 f(x)dx & \text{for continuous $X$} \\ \sum_x [x-\mathbb{E}(X)]^2 f(x) & \text{for discrete $X$} \end{array} \right. \end{align*}\]

Example: Bernoulli

Let $X \sim \textrm{Bernoulli}(\theta)$, and recall that we have $\mathbb{E}(X)=\theta$. Then

\[\begin{align*} \text{Var}(X) &= \mathbb{E}(X-E[X])^2 \\ &= \sum_x \left(x-\mathbb{E}[X]\right)^2 f(x) \\ &= (0-\theta)^2 \times f(0) + (1-\theta)^2\times f(1) \\ &= \theta^2(1-\theta) + (1-\theta)^2\theta \\ &= \theta (1-\theta). \end{align*}\]

Alternative derivation: Since $0^2=0$ and $1^2=1$, we have $X^2=X$ implying that $\mathbb{E}(X^2)=\mathbb{E}(X)=\theta$. Therefore,

\[\begin{align*} \text{Var}(X)=\mathbb{E}[X^2]-(\mathbb{E}[X])^2 = \theta-\theta^2 \end{align*}\]

Sample mean Let $X_1, \ldots, X_n$ denote $n$ observations on a variable $X$, the sample mean is

\[\begin{align*} \overline{X}=\frac{1}{n}\sum_{i=1}^nX_i \end{align*}\]

Sometimes, you add a subscript $n$ to denote the sample size, $\overline{X}_n .$

$\overline{X}$ is a random variable, as it is the average of random variables. This is in sharp contrast to $\mathbb{E}[X]$ which is non-random.

$\overline{X}$ varies with each sample. If we could repeatedly collect new samples of size $n$ from the same population and each time were able to estimate $\overline{X}$, these estimates would be different from each other. The distribution of a statistic, like $\overline{X}$, is called its sampling distribution.

One useful feature is $\mathbb{E}[\overline{X}] = \mathbb{E}[X]$. This doesn’t mean that $\overline{X}$ itself is equal to $\mathbb{E}[X]$. Rather, it means that, if we could repeatedly obtain (a huge number of times) new samples of size $n$ and compute $\overline{X}$ each time, the average of $\overline{X}$ across repeated samples would be equal to $\mathbb{E}[X].$

Proof:

\[\begin{aligned} \mathbb{E}[\overline{X}] &= \mathbb{E}\left[ \frac{1}{n} \sum_{i=1}^n X_i \right] \\ &= \frac{1}{n} \mathbb{E}\left[ \sum_{i=1}^n X_i \right] \\ &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X_i] \\ &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X] \;\;\;\; (X_i\text{ are iid with mean } \mathbb{E}[X]) \\ &= \frac{1}{n}\, n\, \mathbb{E}[X] \\ &= \mathbb{E}[X] \end{aligned}\]

Two ways to compute the Sample variance

  • unadjusted sample variance, also called biased sample variance: \(\begin{align*} S_n^2 = \frac{1}{n}\sum_{i=1}^n(X_i-\overline X)^2 \end{align*}\)

  • adjusted sample variance, also called unbiased sample variance

\[\begin{align*} S_n^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i-\overline X)^2 \end{align*}\]

The latter subtracts 1 from $n$ in the denominator, which is known as a degrees of freedom correction. See proof here.

Distinguish the sample variance $(S^2_n)$ from the variance of sample mean $(\text{Var}(\overline{X}))$.


Covariance and Correlation

Covariance and correlation measure the linear association btw two RVs $X$ and $Y$

\[\begin{aligned} \gamma &\equiv \text{Cov}(X, Y) = \mathbb{E}\big\{[X-\mathbb{E}(X)][Y-\mathbb{E}(Y)]\big\} \\ \rho &\equiv \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \end{aligned}\]

where the expected value $\mathbb{E}[\cdot]$ is taken over the joint distribution of $(X,Y)$.

More formally

\[\begin{aligned} \textrm{Cov}(X, Y) = \begin{cases} \sum_y\sum_x[X-\mathbb{E}(X)][Y-\mathbb{E}(Y)]f(x,y) & \textrm{for } X, Y \textrm{ discrete} \\ \iint[X-\mathbb{E}(X)][Y-\mathbb{E}(Y)]f(x,y) dxdy & \textrm{for } X, Y \textrm{ continuous} \\ \end{cases} \end{aligned}\]

where the sum and integral are over the supports of $X$ and $Y$.

For linear transformations, we have

\[\text{Cov}(a+bX, c+dY) = bd\,\text{Cov}(X,Y)\]

for known constants $a,b,c,d$.

The additive law of covariance holds that the covariance of a random variable with a sum of random variables is just the sum of the covariances with each of the random variables.

\[\text{Cov}(X+Y, Z) = \text{Cov}(X,Z) + \text{Cov}(Y,Z)\]

More generally,

\[\color{#008B45}\text{Cov}\left(\sum_{i=1}^m a_iX_i, \sum_{j=1}^n b_iY_i\right) = \sum_{i=1}^m\sum_{j=1}^n a_ib_j\text{Cov}(X_i, Y_j).\]

One of the applications of covariance is finding the variance of a sum of several random variables. In particular, if $Z=X+Y$, then

\[\begin{aligned} \text{Var}(Z) &= \text{Cov}(Z,Z) \\ &= \text{Cov}(X+Y, X+Y) \\ &= \text{Cov}(X,X) + \text{Cov}(X,Y) + \text{Cov}(Y,X) + \text{Cov}(Y,Y) \\ &= \color{#008B45}\text{Var}(X) + \text{Var}(X) + 2\text{Cov}(X,Y). \end{aligned}\]

More generally, for $a_i\in \mathbb{R}, i=1,\ldots,n$, we conclude:

\[\color{#008B45} \text{Var}\left(\sum_{i=1}^n a_iX_i \right) = \sum_{i=1}^n a_i^2 \text{Var}(X_i) + \sum_{i=1}^n\sum_{j=1}^n a_ia_j \text{Cov}(X_i, X_j).\]

Or equivalently,

\[\color{#008B45} \text{Var}\left(\sum_{i=1}^n a_iX_i \right) = \sum_{i=1}^n a_i^2 \text{Var}(X_i) + 2\sum_{i=2}^n\sum_{j=1}^{i-1} a_ia_j \text{Cov}(X_i, X_j).\]

If we have either $\mathbb{E}(X)=0$ or $\mathbb{E}(Y)=0$ or both

\[\text{Cov}(X, Y) = \mathbb{E}{(XY)} - \mathbb{E}{(X)}\mathbb{E}{(Y)} = \mathbb{E}{(XY)}\]

The sample covariance, $\hat{\gamma}$, in a sample of $n$ observations on $(X_i,Y_i)$ is

\[\hat{\gamma} = \frac{1}{n-1}\sum_{i=1}^n (X_i-\overline{X})(Y_i-\overline{Y})\]

Division by $n-1$ rather than $n$ is called a degrees of freedom correction.

Correlation is a scaled measure of covariance:

\[\textrm{Corr}(X,Y)=\frac{\textrm{Cov}(X, Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}}\]

If $\textrm{Corr}(X,Y)=0$, we say that $X$ and $Y$ are uncorrelated or orthogonal, denoted by $X {\color{#008B45FF}\perp} Y$ (perpendicular symbol).

$X {\color{#008B45FF}\indep} Y$ (double perpendicular symbol) denotes $X$ and $Y$ are independent.

$X \indep Y \Rightarrow X \perp Y$, in plain language, independence implies zero correlation.

Dependent but uncorrelated example

Let $X\sim U(-1,1)$ (任何定义域centered on zero is ok), $Y=f(X)=X^2$. Then $E[X]=0$.

\[\begin{aligned} \text{Cov}(X, Y) &= E[Xf(X)] - E[X]E[f(X)] \\ &= E[Xf(X)] \quad (E[X]=0) \\ &= \int_{-1}^1 x\,f(x)\,p(x)\, dx \end{aligned}\]

where $p(x)$ is the pdf of $X$.

\[p(x) = \left\{ \begin{array}{ll} \frac{1}{2} & x\in [-1,1] \\ 0 & \text{otherwise} \end{array} \right.\]

Now we have $X$ as an odd function and $[-1,1]$ is a symmetric domain. That means as long as $f(x)$ is an even function, we have $xf(x)p(x)$ as on odd function. It then follows that

\[\int_{-a}^a x\,f(x)\,p(x)\, dx = 0,\]

and so $\text{Cov}(X, f(X))=0$.

$f(X)=X^2$ is an even function, which satisfies the condition.


Properties of covariance

  1. The covariance of a variable with itself is the variance of the random variable.

    \[\cov(X, X) = \var(X)\]
  2. The covariance of a random variable, $X$, with a constant, $c$ is zero.

    \[\cov(c, X) = 0\]
  3. The covariance is commutative.

    \[\cov(X, Y) = \cov(Y,X)\]
  4. If $X$ and $Y$ are independent then

    \[\cov(X,Y)=0\]
  5. Adding a constant to either or both random variables does not change their covariances.

    \[\cov(X+c, Y+k) = \cov(X,Y)\]
  6. Multiplying a random variable by a constant multiplies the covariance by that constant.

    \[\cov(cX, kY) = ck\times\cov(X,Y)\]
  7. The covariance of a random variable with a sum of random variables is the sum of the covariances with each of the random variables.

    \[\cov(X+Y, Z) = \cov(X,Z) + \cov(Y,Z)\]
  8. More generally, covariance of sum of random variables:

    \[\cov\left(\sum_{i=1}^ma_iX_i, \sum_{j=1}^nb_jY_j\right)= \sum_{i=1}^m\sum_{j=1}^na_ib_j\cov(X_i,Y_j)\]

Properties for correlation coefficients

  1. Adding a constant to a random variable does not change their correlation coefficient.

    \[\cor(X+c, Y+k) = \cor(X,Y)\]
  2. Multiplying a random variable by a constant does not change their correlation coefficient.

    \[\cor(cX, dY) = \cor(X,Y)\]

Joint Distributions and Independence

Joint Distribution Function (Joint CDF)

\[\begin{aligned} F(x_1, x_2) &= P(X_1\le x_1, X_2\le x_2) \\ &= \left\{ \begin{array}{ll} \displaystyle \int_{-\infty}^{x_1}\int_{-\infty}^{x_2} f(a, b)\, \mathrm da\, \mathrm db & \text{continuous} \\ \displaystyle \sum_{a\le x_1} \sum_{b\le x_2} f(a,b) & \text{discrete} \end{array} \right. \end{aligned}\]

Joint Density Function (Joint PDF)

\[f\,(x_1, x_2) = \left\{ \begin{array}{ll} \displaystyle \frac{\partial^2F(x_1, x_2)}{\partial x_1\, \partial x_2} & \text{continuous} \\ \displaystyle P(X_1=x_1, X_2=x_2) & \text{discrete} \end{array} \right.\]

In other words, for the continuous case, $f(x_1,x_2)$ is the function that satisfies

\[F(x_1, x_2) = \int_{\infty}^{x_2}\int_{\infty}^{x_1} f(a,b)\, \mathrm da\mathrm db\]

Marginals

Consider a discrete random vector, when one of these entries is taken in isolation, its distribution can be characterized in terms of its probability mass function. This is called marginal probability mass function, in order to distinguish it from the joint probability mass function (PMF), which is instead used to characterize the joint distribution of all the entries of the random vector considered together.

Formal Definition Let $X_1, \ldots, X_K$ be $K$ discrete random variable forming a $K\times 1$ random vector. Then, for each $i=1,\ldots,K,$ the probability mass function of the random variable $X_i$, denoted by $p_{X_i}(x_i)\, ,$ is called marginal probability mass function.

$p_{X_i}(x_i)$ is a function: $\mathbb R \mapsto [0,1]$ such that

\[p_{X_i}(x_i) = P(X_i=x_i)\]

where $p_{X_i}(x_i)$ is the probability that $X_i$ will be equal to $x_i$.

By contrast, the joint probability mass function of the vector $X$ is a function $p_X: \mathbb R^K \mapsto [0,1]$ such that

\[p_{X}(x) = p_{X_1,\ldots,X_K}(x_1, \ldots, x_K) = P(X_1=x_1, \ldots, X_K=x_k)\]

where $P(X_1=x_1, \ldots, X_K=x_k)$ is the probability that $X_i$ will be equal to $x_i$, simultaneously for all $i=1,\ldots, K\,.$

Marginal PMFs

The marginal PMF of $X$ is given by

\[\begin{align*} P_X(x) &= P(X=s) \\ &= \sum_{y_j\in R_Y} P(X=x, Y=y_j) \quad \text{(Law of total probability)} \\ &= \sum_{y_j\in R_Y} P_{XY}(x, y_j) \end{align*}\]

where $R_Y={y_1,y_2,\ldots}$ is the range of $Y$.

Similarly, the marginal PMF of $Y$ is given by

\[P_Y(y) = \sum_{x_i\in R_X} P_{XY}(x_i, y)\]

where $R_X={x_1,x_2,\ldots}$ is the range of $X$.

We can define the joint range for $X$ and $Y$ as

\[R_{XY} = \{(x,y) | P_{XY}(x,y) >0 \} \, .\]

Equivalently, we can also write

\[\begin{aligned} R_{XY} &\subset R_X \times R_Y \\ &= \{(x_i,y_j) | x_i\in R_X, y_j\in R_Y \} \, . \end{aligned}\]

Note that

  • The event $X=x$ can be written as ${(x_i,y_j): x_i=x, y_j \in R_Y }\, .$
  • Also, the event $Y=y$ can be written as ${(x_i,y_j): x_i\in R_X, y_j=y}\, .$

It is called marginalization or integrating out $X_2$ to get the marginal of $X_1$.
The marginal of one variable can be obtained by integrating the other variable. This applies to $n$-dimension.

\[f_{X_1}(k) = \left\{ \begin{array}{ll} \displaystyle \int_{-\infty}^\infty f(k, x_2) \, \mathrm dx_2 & \text{continuous} \\ \displaystyle \sum_{x_2} f(X_1=k,X_2=x_2) & \text{discrete} \end{array} \right.\]

Marginalizaing the joint density w.r.t. $X_1$ to get the marginal of $X_2$.

\[f_{X_2}(k) = \left\{ \begin{array}{ll} \displaystyle \int_{-\infty}^\infty f(x_1,k) \, \mathrm dx_1 & \text{continuous} \\ \displaystyle \sum_{x_1} f(X_1=x_1,X_2=k) & \text{discrete} \end{array} \right.\]

where $\sum_{x_1}$ and $\sum_{x_2}$ mean sum over all values of $x_1$ and $x_2$, respectively.

Marginal CDFs

If we know the joint CDF of $X_1$ and $X_2$, we can find the marginal CDFs, $F_{X_1}(x_1)$ and $F_{X_2}(x_2)$. Specifically, for any $x \in \mathbb{R}$, we have

\[F_{X_1}(x_1) = P(X_1\le x_1) = P(X_1\le x_1, X_2\le \infty) = F_{X_1X_2}(x_1,\infty) = \lim_{x_2\to\infty} F_{X_1X_2}(x_1,x_2) \, .\]

Similarly,

\[F_{X_2}(x_2) = P(X_2\le x_2) = P(X_1\le \infty, X_2\le x_2) = F_{X_1X_2}(\infty,x_2) = \lim_{x_1\to\infty} F_{X_1X_2}(x_1,x_2) \, .\]

Example

Suppose we have a pair of discrete random variable ${(X, Y )}$, with an associated joint probability tabulated mass, $f_{XY} (x,y)$, as tabulated below.

$X$
1 2 3
$Y$ -1 0.1 0.1 0.0
0 0.2 0.0 0.3
2 0.1 0.2 0.0

The sum of all probabilities should be equal to 1.

\[\sum_{i}\sum_{j} f_{X,Y} (x_i, y_j) = 1\]

To calculate the marginal masses, we sum along either the rows or the columns, respectively.

Summing along the rows ($X$) gives us the marginal probability of $Y$.

\[\begin{aligned} f_Y(Y=-1) &= \sum_{i=1}^3 f_{X,Y}(x_i, -1) = 0.1+0.1+0 = 0.2 \\ f_Y(Y=0) &= \sum_{i=1}^3 f_{X,Y}(x_i, 0) = 0.2+0+0.3 = 0.5 \\ f_Y(Y=2) &= \sum_{i=1}^3 f_{X,Y}(x_i, 2) = 0.1+0.2+0 = 0.3 \end{aligned}\]

While summing along the columns ($Y$) gives us the marginal probability of $X$.

\[\begin{aligned} f_X(X=1) &= \sum_{j=1}^3 f_{X,Y}(1, y_j) = 0.1+0.2+0.1 = 0.4 \\ f_X(X=2) &= \sum_{j=1}^3 f_{X,Y}(2, y_j) = 0.1+0+0.2 = 0.3 \\ f_X(X=3) &= \sum_{j=1}^3 f_{X,Y}(3, y_j) = 0+0.3+0 = 0.3 \end{aligned}\]

Then the calculation of expectations is straightforward:

\[\begin{aligned} \mathbb E(X) &= \sum_i x_i f_X(x_i) = 1\times 0.4 + 2\times 0.3 + 3\times 0.3 = 1.9 \\ \mathbb E(Y) &= \sum_i y_i f_Y(y_i) = -1\times 0.2 + 0\times 0.5 + 2\times 0.3 = 0.4 \end{aligned}\]

so that, $\mathbb E (X)\, \mathbb E (Y)= 0.76\,.$

After some calculation, summing over the entire table:

\[\mathbb E(XY) = \sum_{i,j}x_i y_j f_{X,Y}(x_i, y_j) =0.7 \,.\]

Since $\mathbb E(XY)=0.7\neq \mathbb E(X)\mathbb E(Y)$, thus $X$ and $Y$ are correlated.


Independence

Two random variables $X_1$ and $X_2$ are independent if and only if

\[F_{X_1X_2}(x_1, x_2) = F_{X_1}(x_1) F_{X_2}(x_2)\]

where $F_{X_1X_2}(x_1, x_2)$ is their joint distribution function and $F_{X_1}(x_1)$ and $F_{X_2}(x_2)$ are their marginal distribution function.

The joint probability density function, $f_{X_1X_2}(x_1, x_2),$ is also the product of their marginal probability density functions.

\[f_{X_1X_2}(x_1, x_2) = f_{X_1}(x_1) f_{X_2}(x_2)\]

Conditioning

Conditional density:

\[\begin{aligned} f_{X_1\vert X_2}(X_1=x_1\vert X_2=x_2) &= \frac{f_{X_1 X_2}(X_1=x_1, X_2=x_2)}{f_{X_2}(X_2=x_2)} \\ f_{X_2\vert X_1}(X_2=x_2\vert X_1=x_1) &= \frac{f_{X_1 X_2}(X_1=x_1, X_2=x_2)}{f_{X_1}(X_1=x_1)} \end{aligned}\]

Note that nominator is the joint density, denominator is the marginal density.
Conditional pdf could be written more succinctly as

\[\begin{aligned} f_{X_1\vert X_2}(x_1\vert x_2) &= \frac{f_{X_1 X_2}(x_1, x_2)}{f_{X_2}(x_2)} \\ f_{X_2\vert X_1}(x_2\vert x_1) &= \frac{f_{X_1 X_2}(x_1, x_2)}{f_{X_1}(x_1)} , \end{aligned}\]

or we can drop the subscripts all together

\[\begin{aligned} f(x_1\vert x_2) &= \frac{f(x_1, x_2)}{f(x_2)} \\ f(x_2\vert x_1) &= \frac{f(x_1, x_2)}{f(x_1)} . \end{aligned}\]

Here, we use the lower case letters to denote the actual values the RV’s take.

Multiplication Rule

\[\begin{aligned} f(x_1, x_2) &= f(x_1\vert x_2)\, f(x_2) \\ &= f(x_2\vert x_1)\, f(x_1) \end{aligned}\]

Conditional cdf is integrated based on conditional pdf:

\[\begin{aligned} F_{X_1\vert X_2}(x_1\vert X_2=x_2) &= P(X_1\le x_1 \vert X_2=x_2) \\ &= \left\{ \begin{array}{ll} \int_{-\infty}^{x_1} f_{X_1\vert X_2}(s {\color{#008B45FF}\vert X_2=x_2}) \, \mathrm ds & \text{continuous} \\ \sum_{s\le x_1} f_{X_1\vert X_2}(X_1=s {\color{#008B45FF}\vert X_2=x_2}) &\text{discrete} \end{array} \right. \end{aligned}\]

or could be written more succinctly as

\[\begin{aligned} F(x_1\vert x_2) &= P(X_1\le x_1 \vert X_2=x_2) \\ &= \left\{ \begin{array}{ll} \int_{-\infty}^{x_1} f(s{\color{#008B45FF}\vert x_2})\, \mathrm ds & \text{continuous} \\ \sum_{s\le x_1} f(X_1=s {\color{#008B45FF}\vert X_2=x_2}) &\text{discrete} \end{array} \right. \end{aligned}\]

Expectation

\[\mathbb{E}[h(X,Y)] = \int_{-\infty}^\infty\int_{-\infty}^\infty h(x,y)\, f_{XY}(x,y) \, \mathrm dx \mathrm dy\]

Conditional expectation on $X$ is indicated using a subscript in $\mathbb{E}_X$, e.g.,

\[\mathbb{E}_X[h(X, Y)] = \mathbb{E}[h(X, Y)|X] = \int_{-\infty}^\infty h(x,y) f_{h(X,Y)|X} (h(x,y)|x) \mathrm dy\]

Here, we “integrate out” the $Y$ variable, and we are left with a function of $X$.

It is also possible the subscript indicates marginal density.

\[\mathbb{E}_X[h(X, Y)] = \int_{-\infty}^\infty h(x,y) f_X(x) \mathrm dx\]

Here, we “average over” the $X$ variable, and we are left with a function of $Y$.


Conditional expectation for several variables

We extend to three variables.

We have the following probability mass function:

\[P(X=x, Y=y, Z=z) = P(x,y,z) \, ,\]

where we use some shorthand, $P(X=x_i, Y=y_j \vert Z=z_k) = P(x,y\vert z)$.

Joint marginal mass is given by

\[\begin{aligned} P(x,y) &= \sum_z P(x,y,z) \\ P(x,z) &= \sum_y P(x,y,z) \\ P(y,z) &= \sum_x P(x,y,z) \end{aligned} .\]

The joint conditional probability is given by

\[P(x,y|z) = \frac{P(x,y,z)}{P(z)} \, ,\]

and the conditional probability.

\[P(x|y,z) = \frac{P(x,y,z)}{P(y,z)} \, .\]

We have the following useful lemmas.

  • Linearity:
\[\mathbb E(aX+bY\vert Z) = a \mathbb E(X\vert Z) + b \mathbb E(Y\vert Z)\]
  • Pull-through rule:

    \[\mathbb E\left[g(X)\,h(Y)\vert Y\right] = h(Y)\, \mathbb E\left[g(X)\vert Y\right]\]

    Proof:

    \[\begin{align*} E\left[g(X)h(Y)\vert Y\right] &= \sum_x g(x) h(y) P(x|y) \\ &= h(y) \sum_x g(x) P(x|y) \\ &= h(y)\, \mathbb E[g(X)|Y=y] \tag*{\(\square\)} \end{align*}\]
  • Tower rule:

    \[\mathbb E\left[ \mathbb E\left(Z\vert X,Y\right) \vert Y\right] = \mathbb E \left[ \mathbb E(Z\vert Y) \vert Y, X\right ]= \mathbb E[Z\vert Y]\]

    This is a generalization of the conditional expectation theorem.

    Proof:

    \[\begin{align*} E\left[ \mathbb E\left(Z\vert X,Y\right) \vert Y\right] &= \sum_x \mathbb E\left(Z\vert X,Y\right) P(x|y) \\ &= \sum_x \left[\sum_z z P(z|x,y)\right] P(x|y) \\ &= \sum_x \sum_z z \, \frac{P(x,y,z)}{P(x,y)} \times \frac{P(x,y)}{P(y)} \\ &= \sum_x\sum_z z\, \frac{P(x,y,z)}{P(y)} \\ &= \sum_x\sum_z z\, P(x,z|y) \\ &= \sum_z z\, P(z|y) \\ &= \mathbb E(Z|Y) \end{align*}\]

Moments

We often summarize properties of distributions using their moments.

The $r^{\text{th}}$ order moment is defined by

\[\begin{aligned} \mathbb{E}(X^r) = \left\{ \begin{array}{ll} \int_{-\infty}^\infty x^r f(x)dx & \text{continuous} \\ \sum_x x^r f(x) &\text{discrete} \end{array} \right. \end{aligned}\]

First order moment

The first moment is called the expected value or expectation, which is given by

\[\begin{aligned} \mathbb{E}(X) = \left\{ \begin{array}{ll} \int_{-\infty}^\infty x f(x)dx & \text{continuous} \\ \sum_x x f(x) &\text{discrete} \end{array} \right. \end{aligned}\]

Second moment about the mean

Also called second central moment. The variance is obtained by setting $g(X)=\left[X-\mathbb{E}(X)\right]^2.$

\[\begin{aligned} \text{Var}(X) = \mathbb{E}[\left(X-\mathbb{E}(X)\right)^2] = \left\{ \begin{array}{ll} \int_{-\infty}^\infty \left[X-\mathbb{E}(X)\right]^2 f(x)dx & \text{continuous} \\ \sum_x \left[X-\mathbb{E}(X)\right]^2 f(x) &\text{discrete} \end{array} \right. \end{aligned}\]

Unconditional expectation of functions of RVs

\[\begin{aligned} \mathbb{E}[g(x)] = \left\{ \begin{array}{ll} \int_{-\infty}^\infty g(x){\color{red}f(x)}dx & \text{continuous} \\ \sum_x g(x){\color{red}f(x)} &\text{discrete} \end{array} \right. \end{aligned}\]

Conditional expectation of functions of RVs

\[\begin{aligned} \mathbb{E}(g(x) \vert Y=y) = \left\{ \begin{array}{ll} \int_{-\infty}^\infty g(x){\color{#008B45FF}f(x\vert Y=y)}dx & \text{continuous} \\ \sum_x g(x){\color{#008B45FF}f(x\vert Y=y)} &\text{discrete} \end{array} \right. \end{aligned}\]

Note:

  • For unconditional moments, use the appropriate unconditional density.
  • For conditional moments, use the appropriate conditional density.
  • Expectation or expected value is a population quantity because it requires knowledge of the density function.
  • The sample analogue of the expected value is the sample mean or sample average.

Conditional and Unconditional Variance

Unconditional

\[\begin{aligned} \text{Var}(X) &= \mathbb{E} [(X-\mathbb{E}(X))^2] \\ &= \left\{ \begin{array}{ll} \int_{-\infty}^{\infty} \left(x-\mathbb{E}(X) \right)^2 f(x) dx & \text{continuous} \\ \sum_{x} \left(x-\mathbb{E}(X) \right)^2 f(x) &\text{discrete} \end{array} \right. \end{aligned}\]

Conditional variance use conditional expectations

\[\begin{aligned} \text{Var}(X{\color{#008B45FF}\vert Y}) &= \mathbb{E} [(X-\mathbb{E}(X{\color{#008B45FF}\vert Y}))^2 {\color{#008B45FF}\vert Y}] \\ &= \left\{ \begin{array}{ll} \int_{-\infty}^{\infty} [x-\mathbb{E}(X{\color{#008B45FF}\vert Y=y})]^2 f(x{\color{#008B45FF}\vert y}) dx & \text{continuous} \\ \sum_{x} [x-\mathbb{E}(X{\color{#008B45FF}\vert Y=y})]^2 f(x{\color{#008B45FF}\vert y}) &\text{discrete} \end{array} \right. \end{aligned}\]

Alternatively, the conditional variance can be written as

\[\text{Var}(X{\color{#008B45FF}\vert Y}) = \mathbb{E}[X^2{\color{#008B45FF}\vert Y}] - \left(\mathbb{E}[X{\color{#008B45FF}\vert Y} ]\right )^2\]

Independence conditional on other variables

$X_1$ and $X_2$ are said to be independent conditional on $X_3$ if

\[f(x_1, x_2 \vert x_3) = f(x_1\vert x_3) f(x_2\vert x_3)\]

The left-hand-side (LHS) represents the joint behavior of $X_1$ and $X_2$ conditional on $X_3$, the RHS represents the individual behavior conditional on $X_3$.

This is denoted as

\[X_1 \indep X_2 \vert X_3\]

Note that this does not imply $X_1 \indep X_2$!
E.g., $X_1$ and $X_2$ can be returns on two equities where $X_3$ is some global macroeconomic factor affecting multiple variables at once (e.g. federal reserve interest rate).
Another example for $X_1, X_2$ would be wages and level of education, whereas $X_3$ is level of intelligence.

For three or more RVs, the joint PDF, joint PMF, and joint CDF are defined in a similar way to what we have seen for the case of two random variables.
Let $X_1, X_2, \ldots, X_n$ be $n$ discrete RVs. The joint PMF is defined as

\[P_{X_1, X_2, \ldots, X_n} (x_1, x_2, \ldots, x_n) = P(X_1=x_1, X_2=x_2, \ldots, X_n=x_n).\]

For $n$ jointly continuous RVs $X_1, X_2, \cdots, X_n$, the joint PDF is defined to be the function $f_{X_1, X_2, \cdots, X_n}(x_1, x_2, \cdots, x_n)$ such that the probability of any set $A\in \mathbb{R}^n$ is given by the integral of the PDF over the set $A$. In particular, we can write

\[P\big((X_1, X_2, \cdots, X_n)\in A\big) = \int \cdots \underset{A}{\int} \cdots \int f_{X_1, X_2, \cdots, X_n}(x_1, x_2, \cdots, x_n) dx_1dx_2\cdots dx_n.\]
Definition $X_1, X_2, \cdots, X_n$ are said to be independent and identically distributed (iid) if they are independent, and they have the same marginal distributions: $$ F_{X_1}(x) = F_{X_2}(x) = \cdots = F_{X_n}(x), \text{for all } x\in\mathbb{R}. $$

The moment-generating function (MGF) of a random variable $X$ is

\[M_X(t) = E(e^{tX}) = \begin{cases} \displaystyle \int_{-\infty}^{\infty} e^{tx} f_X(x) \mathrm ds & \text{continuous } X \\ \displaystyle \sum_{i=0}^\infty e^{tx_i} P(X=x_i) & \text{discrete } X \end{cases}\]
  • The MGF of X gives us all moments of X. That is why it is called the moment generating function.

    We can obtain all moments of $X$:

    \[\mu_n = E[X^n] = M_X^{(n)} (0) = \left. \frac{\mathrm d^n M_X}{\mathrm dt^n} \right\vert_{t=0} \,.\]

    That is, the $n$th moment about zero is the $n$th derivative of the moment generating function, evaluated at $t=0.$

  • The MGF (if it exists) uniquely determines the distribution.

Linear Algebra
https://www.youtube.com/watch?v=fNk_zzaMoSs&list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab


Change of Variable Formula

Given the distribution of $X$, we can obtain the distribution of a continuous function of $X$, e.g. $Y=g(X)$.

\[\begin{align*} F_Y(y)&=P(Y\le y)=P\left(g(X)\le y\right) \\ &=\left\{ \begin{array}{ll} P\left(X\le g^{-1}(y)\right) = F_X(g^{-1}(y)) & \text{when $g(x)$ is $\uparrow$} \\ P\left(X\ge g^{-1}(y)\right) = 1- F_X(g^{-1}(y)) & \text{when $g(x)$ is $\downarrow$} \end{array} \right. \\ f_Y(y) &= f_X(g^{-1}(y)) \cdot \left\vert \frac{\partial }{\partial y} g^{-1}(y) \right\vert \end{align*}\]

Change of Variable for a Double Integral

Let $X$ and $Y$ be two jointly continuous random variables. Let $(Z, W) = g(X,Y) = (g_1(X,Y), g_2(X,Y))$, where $g$: $\mathbb{R}^2 \mapsto \mathbb{R}^2$ is a continuous one-to-one (invertible) function with continuous partial derivatives.

Let $h=g^{-1}$, i.e., the inverse function that takes $(Z,W)$ and returns $(X,Y)$, $(X,Y)=h(Z,W)=(h_1(Z,W),h_2(Z,W))$.

Then $Z$ and $W$ are jointly continuous and their joint PDF, $f_{ZW}(z,w)$, for $(z,w)\in R_{ZW}$ is given by

\[f_{ZW}(z,w)=f_{XY}(h_1(z,w),h_2(z,w)) \cdot \vert \boldsymbol{J}\vert,\]

where $\boldsymbol{J}$ is the Jacobian of $h$, its determinant is defined by

\[\abs{\boldsymbol{J}} = \text{det} \begin{bmatrix} \frac{\partial h_1}{\partial z} & \frac{\partial h_1}{\partial w} \\ \frac{\partial h_2}{\partial z} & \frac{\partial h_2}{\partial w} \end{bmatrix} = \frac{\partial h_1}{\partial z} \cdot \frac{\partial h_2}{\partial w} - \frac{\partial h_1}{\partial w} \cdot \frac{\partial h_2}{\partial z}\]

Let $X$ and $Y$ be two random variables with joint PDF $f_{XY}(x,y)$. Let $Z=X+Y$. Find $f_Z(z)$.

To apply the change of variable, we need two random variables $Z$ and $W$. Define

\[\begin{align*} \left\{ \begin{array}{ll} z = x+y & \text{i.e., }g_1(x,y)\\ w = x & \text{i.e., }g_2(x,y) \end{array} \right. \end{align*}\]

Then we can find the inverse transform:

\[\begin{align*} \left\{ \begin{array}{ll} x = w & \text{i.e., }h_1(x,y)\\ y = z-w & \text{i.e., }h_2(x,y) \end{array} \right. \end{align*}\]

Then, we have absolute value of the Jacobian

\[\vert \boldsymbol{J} \vert = \vert \text{det} \begin{bmatrix} 0 & 1 \\ 1 & -1 \end{bmatrix} \vert = \vert 0-1 \vert = 1\]

Thus,

\[f_{ZW}(z,w)=f_{XY}(w,z−w).\]

But since we are interested in the marginal PDF, $f_Z(z)$, we have

\[f_Z(z)=\int_{-\infty}^{\infty}f_{XY}(w,z−w)dw.\]

Note that, if $X$ and $Y$ are independent, then $f_{XY}(x,y)=f_X(x)fY(y)$ and we conclude that \(f_Z(z)=\int_{-\infty}^{\infty}f_X(w)f_Y(z−w)dw.\)

The above integral is called the convolution of $f_X$ and $f_Y$, and we write

\[f_Z(z)=f_X(z)*f_Y(z) = \int_{-\infty}^{\infty}f_X(w)f_Y(z−w)dw = \int_{-\infty}^{\infty}f_Y(w)f_X(z−w)dw.\]
Theorem If X and Y are two jointly continuous random variables and $Z=X+Y$, then $$ f_Z(z)=\int_{-\infty}^{\infty}f_{XY}(w,z−w)dw. $$ If X and Y are also independent, then $$ f_Z(z) = \int_{-\infty}^{\infty}f_Y(w)f_X(z−w)dw. $$

Let $X$ and $Y$ be two independent discrete random variables. Denote their respective pmfs (probability mass function) by $p_X(x)$ and $p_Y(y)$, and their supports by $R_X$ and $R_{Y}$. Let

\[Z=X+Y\]

and denote the pmf of $Z$ by $p_Z(z)$. Then,

\[\begin{align*} p_Z(z) &= \sum_{k=-\infty}^{\infty} p_X(x=k)\cdot p_Y(y=z-k) \\ &= \sum_{x\in R_X} p_X(x=k)\cdot p_Y(y=z-k) \end{align*}\]

or

\[p_Z(z) = \sum_{y\in R_y} p_X(x=z-k)\cdot p_Y(y=k).\]

Big-O Little-o Notation

Consider a sequence of random variables $X_n$ and a sequence of constants $a_n$ for $n = 1, 2, . . .$
If $X_n/a_n \xrightarrow{p} 0$, we say $(X_n/a_n)=o_p(1)$ or $X_n=o_p(a_n)$.
Consequently:

  1. If $X_n\xrightarrow{p}0$, we say $X_n=o_p(1)$.
  2. If $n^\alpha X_n\xrightarrow{p}0$ for some $\alpha$, we say $X_n/n^{-\alpha}=o_p(1)$ or $X_n=o_p(n^{-\alpha})$.

E.g., For $X_1, X_2, \ldots, X_n$ iid with mean $\mu$ and variance $\sigma^2$, by the LLN we have $\overline{X}\xrightarrow{p}\mu$. Then, $\overline{X}-\mu\xrightarrow{p}0$ and so $\overline{X}-\mu = o_p(1)$.

Big-O notation relaxes the convergence to a finite limit (zero or non-zero).
If $X_n/a_n \xrightarrow{d} X$ or $X_n/a_n \xrightarrow{p} X$, we say $(X_n/a_n)=O_p(1)$ or $X_n=O_p(a_n)$.
Consequently:

  1. If $X_n\xrightarrow{d}X$ or $X_n\xrightarrow{p}X$, we say $X_n=O_p(1)$.
  2. If $n^\alpha X_n\xrightarrow{d}X$ or $n^\alpha X_n\xrightarrow{p}X$ for some $\alpha$, we say $X_n/n^{-\alpha}=O_p(1)$ or $X_n=O_p(n^{-\alpha})$.

E.g., For $X_1, X_2, \ldots, X_n$ iid with mean $\mu$ and variance $\sigma^2$, define $Z=\frac{\overline{X}-\mu}{\sigma}$. Then, by the CLT we have $\sqrt{n}Z\xrightarrow{d}N(0,1)$ and so $\sqrt{n}Z = O_p(1)$ or equivalently $Z=O_p(n^{-1/2})$.


Type I and II Errors

Type I error: rejecting a true $H_0$. Corresponds to the level of significance, $\alpha$,

\[\alpha = P(\text{reject } H_0 \vert H_0 \text{ is true})\,.\]

Type II error: failing to reject a false $H_0$. The probability of committing a Type II error is called $\beta\,.$

\[\beta=P(\text{fail to reject} H_0\vert H_0 \text{ is false})\,.\]

$\beta$ is related to the Power of a test. $\beta = 1-\text{Power of a test} = 1-P(\text{reject } H_0\vert H_0 \text{ is false})\,.$

In hypothesis testing, the size of a test is the (maximum) probability of committing a Type I error, that is, of incorrectly rejecting the null hypothesis when it is true.

The power of a test refers to the probability of correctly rejecting $H_0$ when $H_1$ is true.

type-i-and-type-ii-error

Note:

  • If $\alpha$ increases that means the chances of making a type I error will increase. It is more likely that a type I error will occur. It makes sense that you are less likely to make type II errors, only because you will be rejecting $H_0$ more often. You will be failing to reject $H_0$ less, and therefore, the chance of making a type II error will decrease. Thus, as $\alpha$ increases, $\beta$ will decrease, and vice versa. That makes them seem like complements, but they are not complements.
  • For a constant sample size, $n$, if $\alpha$ increases, $\beta$ decreases.
    For a constant significance level, $\alpha$ , if $n$ increases, $\beta$ decreases.

Confusion matrix

Precision is the proportion of all the model’s positive classifications that are actually positive.

\[\text{Precision} = \frac{\text{correctly classified actual positives}} {\text{everything classified as positive}} = \frac{TP}{TP+FP}\]

The True Positive Rate (TPR), or the proportion of all actual positives that were classified correctly as positives, is also known as recall.

A hypothetical perfect model would have zero false negatives and therefore a recall (TPR) of 1.0, which is to say, a 100% detection rate.

In an imbalanced dataset where the number of actual positives is very low, recall is a more meaningful metric than accuracy because it measures the ability of the model to correctly identify all positive instances. For applications like disease prediction, correctly identifying the positive cases is crucial. A false negative typically has more serious consequences than a false positive.

\[\text{Recall (or True Positive Rate, TPR)} = \frac{\text{correctly classified actual positives}} {\text{all actual positives}} = \frac{TP}{TP+FN}\]

The False Positive Rate (FPR) is the proportion of all actual negatives that were classified incorrectly as positives, also known as the probability of false alarm.

\[\text{FPR} = \frac{\text{incorrectly classified actual negatives}} {\text{all actual negatives}} = \frac{FP}{FP+TN}\]

Precision improves as false positives decrease, while recall improves when false negatives decrease.

Receiver-operating characteristic curve (ROC)

The ROC curve is a visual representation of model performance across all thresholds.

The ROC curve is drawn by calculating the true positive rate (TPR) and false positive rate (FPR) at every possible threshold (in practice, at selected intervals), then graphing TPR over FPR.

A perfect model, which at some threshold has a TPR of 1.0 and a FPR of 0.0, can be represented by either a point at (0, 1) if all other thresholds are ignored, or by the following:

ROC and AUC of a hypothetical perfect model. Source: Machine Learning, Google for Developers.

Toggle thresholds and see how metrics and ROC curve change: https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc#auc_and_roc_for_choosing_model_and_threshold

The area under the ROC curve (AUC) represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative.

ROC and AUC of completely random guesses.

For a binary classifier, a model that does exactly as well as random guesses or coin flips has a ROC that is a diagonal line from (0,0) to (1,1). The AUC is 0.5, representing a 50% probability of correctly ranking a random positive and negative example.

The points on a ROC curve closest to (0,1) represent a range of the best-performing thresholds for the given model.

Three labeled points representing thresholds.

If false positives (false alarms) are highly costly, it may make sense to choose a threshold that gives a lower FPR, like the one at point A, even if TPR is reduced. Conversely, if false positives are cheap and false negatives (missed true positives) highly costly, the threshold for point C, which maximizes TPR, may be preferable. If the costs are roughly equivalent, point B may offer the best balance between TPR and FPR.

A concrete example: Imagine a situation where it’s better to allow some spam to reach the inbox than to send a business-critical email to the spam folder. You’ve trained a spam classifier for this situation where the positive class is spam and the negative class is not-spam. In this use case, it’s better to minimize false positives, even if true positives also decrease. Choose point A.


Q: When should we use one-tailed hypothesis testing?
A: Authors should explain why they are more interested in an effect in one direction not the other.

Ex1: we compare the mean strength of parts from a supplier (102) to a target value (100). We are considering a new supplier only if the mean strength of their parts is greater than our target value. There is no need for us to distinguish between whether their parts are equally strong or less strong than the target value — either way we’d just stick with our current supplier.

$H_0$: new supplier = target value
$H_1$: new supplier > target value

Ex1: We want to know if the battery life is greater than the original after a manufacturing change.

$H_0$: new battery life = original life
$H_1$: new battery life > original life

  • One-tailed test improve the power of a test, that is correctly reject $H_0$ if the null hypothesis is truly false.

log-transformed Models

  1. Only the dependent/response variable is log-transformed. Exponentiate the coefficient. This gives the multiplicative factor for every one-unit increase in the independent variable. Example: the coefficient is 0.198. exp(0.198) = 1.218962. For every one-unit increase in the independent variable, our dependent variable increases by a factor of about 1.22, or 22%. Recall that multiplying a number by 1.22 is the same as increasing the number by 22%. Likewise, multiplying a number by, say 0.84, is the same as decreasing the number by 1 – 0.84 = 0.16, or 16%.
  2. Only independent/predictor variable(s) is log-transformed. Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units. Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002. For x percent increase, multiply the coefficient by log(1.x). Example: For every 10% increase in the independent variable, our dependent variable increases by about 0.198 * log(1.10) = 0.02.
  3. Both dependent/response variable and independent/predictor variable(s) are log-transformed. Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%. For x percent increase, calculate 1.x to the power of the coefficient, subtract 1, and multiply by 100. Example: For every 20% increase in the independent variable, our dependent variable increases by about (1.20 0.198 – 1) * 100 = 3.7 percent.

Semi-parametric models

  • Method of Moments
  • quantile regression

Textbooks

  • Econometric Analysis , 5th Edition, by William H. Greene, Prentice Hall, 2003.
    Time Series Analysis, by J. D. Hamilton, Princeton University Press, 1994.
    Estimation and Inference in Econometrics, by R. Davidon and J. MacKinnon, Oxford University Press, 1993.
    Econometric Analysis of Cross Section and Panel Data, by J. Wooldridge, MIT Press, 1999.
    Microeconometrics: Methods and Applications, Cameron, A. C. and Trivedi, P. K., Cambridge University Press, 2005.

  • Core Metrics textbooks

    • Casella, G. and Berger, R.L. (2002) Statistical Inference. 2nd ed. Duxbury.
    • Hendry, D.F. and Nielsen, B. (2007) Econometric Modeling. Princeton.
    • Hoel, P.G., Port, S.C. and Stone, C.J. (1971) Introduction to Probability. Boston: Houghton-Mifflin.
  • Resources