2  Conditional Expectation

Identities for conditional expectations

2.1 Generalized Adam’s Law

\[ \E\left[ \E[Y \mid g(X)] \mid f(g(X)) \right] = \E[Y\mid f(g(X))] \]

Show that the following identify is a special case of the Generalized Adam’s Law:

\[ \E[\E[Y\mid X,Z] \mid Z] = \E[Y\mid Z] \]

Proof. If we take \(f(g(x, z)) = z\) and \(g(x, z) = (x, z)\) in the generalized Adam’s Law, we get the result.

2.2 Projection interpretation

Conditional expectation gives the best prediction

Theorem 2.1 (Conditional expectation minimizes MSE) Suppose we have random element \(X\in \Xcal\) and random variable \(Y\in\R.\) Let \(g(x)=\E[Y\mid X=x].\) Then

\[ g(x) = \underset{f}{\arg\min}\, \E(Y-f(X))^2 \]

Proof. \[ \begin{split} \E(Y-f(X))^2 &= \E\left[(Y-\E[Y\mid X]) + (\E[Y\mid X]-f(X)) \right]^2 \quad (\text{plus and minus } \E[Y\mid X]) \\ &= \E(Y-\E[Y\mid X])^2 + \E(\E[Y\mid X]-f(X))^2 \\ & \phantom{=}\; + 2\E[(Y-\E[Y\mid X])(\underbrace{\E[Y\mid X]-f(X)}_{h(X)})] \quad \Bigl(\E\bigl[ \bigl(Y-\E[Y\vert X]\bigr) h(X) \bigr] = 0\Bigr) \\ &= \E(Y-\E[Y\mid X])^2 + \E(\E[Y\mid X]-f(X))^2 \end{split} \]

The first term is independent of \(f\), and the second term is minimized by taking \(f(x)=\E[Y\mid X].\)

If we think of \(\E[Y\mid X]\) as a prediction/projection for \(Y\) given \(X\), then \((Y-\E[Y\mid X])\) is the residual of that prediction.

It’s helpful to think of decomposing \(Y\) as

\[ Y = \underbrace{\E[Y\mid X]}_\text{best prediction for $Y$ given $X$} + \underbrace{(Y-E[Y\mid X])}_\text{residual} \]

Note that the two terms on the RHS are uncorrelated, by the projection interpretation.

Since variance is additive for uncorrelated random variables (i.e., if \(X\) and \(Y\) are uncorrelated, then \(\var(X+Y)=\var(X)+\var(Y)\)), we get the following theorem

Theorem 2.2 (Variance decomposition with projection) For any random variable \(X\in \Xcal\) and random variable \(Y\in \R,\) we have

\[ \var(Y) = \var(\E[Y\mid X]) + \var(Y-\E[Y\mid X]) \]

Theorem 2.1 tells us that \(\E[Y\mid X]\) is the best approximation of \(Y\) we can get from \(X.\) We can also think of \(\E[Y\mid X]\) as a “less random” version of \(Y,\) since \(\var(\E[Y\vert X]) \le \var(Y).\)

We can say that \(\E[Y\mid X]\) only keeps the randomness in \(Y\) that is predictable from \(X.\) \(\E[Y\mid X]\) is a deterministic function of \(X,\) so there’s no other source of randomness in \(\E[Y\mid X].\)

Theorem 2.3 (Projection interpretation) For any \(h:\Xcal \to \R,\)

\[ \E[(Y-\E[Y\mid X])h(X)]=0 \]

Theorem 2.3 says that the residual of \(\E[Y\mid X]\) is “orthogonal” to every random variable of the form \(h(X).\)

2.3 Keeping just what is needed

Theorem 2.4 For any random variables \(X, Y\in \R,\)

\[ \E[XY] = \E[X\E[Y\mid X]] \]

One way to think about this is that for the purposes of computing \(\E [XY],\) we only care about the randomness in \(Y\) that is predictable from \(X\).

Proof.

\[ \begin{split} \E[XY] &= \E[\E[XY\mid X]] \quad (\text{LIE}) \\ &= \E[X\E[Y\mid X]] \quad (\text{Taking out what is known}) \end{split} \]

Proof (Alternative proof1). We can show this using the projection interpretation:

\[ \begin{split} \E[XY] &= \E\left[ X \left(\E[Y\mid X] + \underbrace{Y-\E[Y\mid X]}_\text{residuals uncorrelated with $X$} \right)\right] \\[1em] &= \E[X\E[Y\mid X]] + \E[X(Y-\E[Y\mid X])] \\ &= \E[X\E[Y\mid X]] \quad (\text{Projection interpretation, } \E[X(Y-\E[Y\mid X])]=0) \end{split} \]

Proof (Alternative proof2).

\[ \begin{split} \E[X\E[Y\mid X]] &= \sum_x x\E[Y\mid X=x] \P(X=x) \\ &= \sum_x\sum_y xy\P(Y=y\mid X=x)\P(X=x) \\ &= \sum_x\sum_y xy \P(Y=y, X=x) \end{split} \]

A more general case of \(\E[XY] = \E[X\E[Y\mid X]]\) is

References