2 Conditional Expectation

Identities for conditional expectations

LIE (Adam’s Law)

\[ \mathbb{E}[Y] = \mathbb{E}\left[\mathbb{E}\right(Y\mid X\left)\right] \]
Generalized Adam’s Law

\[ \E\left[ \E[Y \mid g(X)] \mid f(g(X)) \right] = \E[Y\mid f(g(X))] \]

for any $f$ and $g$ with compatible domains and ranges. We also have that

\[ \E\left[ \E[Y \mid g(X)] \mid f(g(X)=z) \right] = \E[Y\mid f(g(X))=z] \]

for any $z.$
Independence: $\E[Y\mid X] = \E[Y]$ if $X$ and $Y$ are independent.
Taking out what is known: $\E[h(X)Z\mid X]=h(X)\E[Z\mid X]$
Linearity: $\E[aX+bY\mid Z] = a\E[X\mid Z] + b\E[Y\mid Z],$ for any $a, b\in \R.$
Projection interpretation: $\E\left[\left(Y-\E[Y\mid X]\right)h(X)\right] = 0$ for any function $h: \Xcal \to \R.$
Keeping just what is needed: $\E[XY] = \E[X\E[Y\mid X]]$ for $X, Y\in \R.$

2.1 Generalized Adam’s Law

\[ \E\left[ \E[Y \mid g(X)] \mid f(g(X)) \right] = \E[Y\mid f(g(X))] \]

Show that the following identify is a special case of the Generalized Adam’s Law:

\[ \E[\E[Y\mid X,Z] \mid Z] = \E[Y\mid Z] \]

Proof. If we take $f(g(x, z)) = z$ and $g(x, z) = (x, z)$ in the generalized Adam’s Law, we get the result.

2.2 Projection interpretation

Conditional expectation gives the best prediction

Theorem 2.1 (Conditional expectation minimizes MSE) Suppose we have random element $X\in \Xcal$ and random variable $Y\in\R.$ Let $g(x)=\E[Y\mid X=x].$ Then

\[ g(x) = \underset{f}{\arg\min}\, \E(Y-f(X))^2 \]

Proof. \[ \begin{split} \E(Y-f(X))^2 &= \E\left[(Y-\E[Y\mid X]) + (\E[Y\mid X]-f(X)) \right]^2 \quad (\text{plus and minus } \E[Y\mid X]) \\ &= \E(Y-\E[Y\mid X])^2 + \E(\E[Y\mid X]-f(X))^2 \\ & \phantom{=}\; + 2\E[(Y-\E[Y\mid X])(\underbrace{\E[Y\mid X]-f(X)}_{h(X)})] \quad \Bigl(\E\bigl[ \bigl(Y-\E[Y\vert X]\bigr) h(X) \bigr] = 0\Bigr) \\ &= \E(Y-\E[Y\mid X])^2 + \E(\E[Y\mid X]-f(X))^2 \end{split} \]

The first term is independent of $f$, and the second term is minimized by taking $f(x)=\E[Y\mid X].$

□

If we think of $\E[Y\mid X]$ as a prediction/projection for $Y$ given $X$, then $(Y-\E[Y\mid X])$ is the residual of that prediction.

It’s helpful to think of decomposing $Y$ as

\[ Y = \underbrace{\E[Y\mid X]}_\text{best prediction for $Y$ given $X$} + \underbrace{(Y-E[Y\mid X])}_\text{residual} \]

Note that the two terms on the RHS are uncorrelated, by the projection interpretation.

Since variance is additive for uncorrelated random variables (i.e., if $X$ and $Y$ are uncorrelated, then $\var(X+Y)=\var(X)+\var(Y)$), we get the following theorem

Theorem 2.2 (Variance decomposition with projection) For any random variable $X\in \Xcal$ and random variable $Y\in \R,$ we have

\[ \var(Y) = \var(\E[Y\mid X]) + \var(Y-\E[Y\mid X]) \]

Theorem 2.1 tells us that $\E[Y\mid X]$ is the best approximation of $Y$ we can get from $X.$ We can also think of $\E[Y\mid X]$ as a “less random” version of $Y,$ since $\var(\E[Y\vert X]) \le \var(Y).$

We can say that $\E[Y\mid X]$ only keeps the randomness in $Y$ that is predictable from $X.$ $\E[Y\mid X]$ is a deterministic function of $X,$ so there’s no other source of randomness in $\E[Y\mid X].$

Theorem 2.3 (Projection interpretation) For any $h:\Xcal \to \R,$

\[ \E[(Y-\E[Y\mid X])h(X)]=0 \]

Theorem 2.3 says that the residual of $\E[Y\mid X]$ is “orthogonal” to every random variable of the form $h(X).$

2.3 Keeping just what is needed

Theorem 2.4 For any random variables $X, Y\in \R,$

\[ \E[XY] = \E[X\E[Y\mid X]] \]

One way to think about this is that for the purposes of computing $\E [XY],$ we only care about the randomness in $Y$ that is predictable from $X$.

Proof.

\[ \begin{split} \E[XY] &= \E[\E[XY\mid X]] \quad (\text{LIE}) \\ &= \E[X\E[Y\mid X]] \quad (\text{Taking out what is known}) \end{split} \]

□

Proof (Alternative proof1). We can show this using the projection interpretation:

\[ \begin{split} \E[XY] &= \E\left[ X \left(\E[Y\mid X] + \underbrace{Y-\E[Y\mid X]}_\text{residuals uncorrelated with $X$} \right)\right] \\[1em] &= \E[X\E[Y\mid X]] + \E[X(Y-\E[Y\mid X])] \\ &= \E[X\E[Y\mid X]] \quad (\text{Projection interpretation, } \E[X(Y-\E[Y\mid X])]=0) \end{split} \]

□

Proof (Alternative proof2).

\[ \begin{split} \E[X\E[Y\mid X]] &= \sum_x x\E[Y\mid X=x] \P(X=x) \\ &= \sum_x\sum_y xy\P(Y=y\mid X=x)\P(X=x) \\ &= \sum_x\sum_y xy \P(Y=y, X=x) \end{split} \]

□

A more general case of $\E[XY] = \E[X\E[Y\mid X]]$ is

References

David S. Rosenberg. Conditional Expectations: Review and Lots of Examples, https://davidrosenberg.github.io/ttml2021fall/background/conditional-expectation-notes.pdf