Proof of the Data Processing Inequality

Theorem

[!abstract] Data Processing Inequality If $X \to Y \to Z$ forms a Markov chain (i.e., $X$ and $Z$ are conditionally independent given $Y$ ), then:
$I(X; Y) \geq I(X; Z)$
with equality if and only if $X \to Z \to Y$ also forms a Markov chain.

In words: Processing data can only destroy information, never create it.

Proof Strategy

We’ll use the chain rule for mutual information and properties of conditional mutual information.

Preliminary Lemmas

Lemma 1: Chain Rule for Mutual Information

Statement: $I(X; Y, Z) = I(X; Z) + I(X; Y | Z)$

Proof:

I(X; Y, Z) = H(X) - H(X|Y,Z)

I(X; Z) + I(X; Y|Z) = H(X) - H(X|Z) + H(X|Z) - H(X|Y,Z)

= H(X) - H(X|Y,Z) \quad \checkmark

Lemma 2: Markov Chain Condition

Statement: $X \to Y \to Z$ is Markov iff $I(X; Z | Y) = 0$

Proof: $X \to Y \to Z$ means $p(x, z | y) = p(x|y)p(z|y)$ . Thus $X \perp Z | Y$ , so conditioning on $Y$ makes $X$ and $Z$ independent, hence $I(X;Z|Y) = 0$ .

Main Proof

Setup

Assume $X \to Y \to Z$ is a Markov chain.

Key Step

Apply the chain rule two ways:

First way: $I(X; Y, Z) = I(X; Z) + I(X; Y | Z)$

Second way: $I(X; Y, Z) = I(X; Y) + I(X; Z | Y)$

Conclusion

Since $X \to Y \to Z$ is Markov, by Lemma 2: $I(X; Z | Y) = 0$ .

From the second expansion: $I(X; Y, Z) = I(X; Y) + 0 = I(X; Y)$

From the first expansion: $I(X; Y) = I(X; Z) + I(X; Y | Z)$

Since mutual information is non-negative, $I(X; Y | Z) \geq 0$ , therefore:

$I(X; Y) \geq I(X; Z)$

$\square$

Interpretation

This theorem has profound implications:

No algorithm can extract more information about $X$ from $Z$ than was in $Y$ . If $Y$ is a lossy compression of $X$ , and $Z$ is computed from $Y$ , then $Z$ has even less information about $X$ .
For the dissertation: Meme transmission is a Markov chain: $\text{Source idea} \to \text{Encoding} \to \text{Received idea}$ . The DPI says the receiver can never have more information about the source than was in the transmitted message.
Communication bound: This is why channel capacity matters—it limits how much information can traverse the channel.

Corollaries

Sufficient statistics: If $T(Y)$ is a sufficient statistic for $X$ , then $I(X; Y) = I(X; T(Y))$ —no information is lost.
Repeated processing: For any chain $X \to Y_1 \to Y_2 \to \cdots \to Y_n$ : $I(X; Y_1) \geq I(X; Y_2) \geq \cdots \geq I(X; Y_n)$

Generalizations

The inequality extends to continuous random variables.
There’s a strengthened version involving the contraction coefficient.
Related: Fano’s inequality provides a lower bound on error probability.

Historical Notes

The data processing inequality was implicit in Shannon’s 1948 paper but was formalized later. It’s sometimes called the “no free lunch theorem of information theory.”

Sources

Cover & Thomas, Elements of Information Theory, Theorem 2.8.1
Shannon (1948), implicitly in channel coding theorem