Skip to content

Proof of the Data Processing Inequality

[!abstract] Data Processing Inequality If XYZX \to Y \to Z forms a Markov chain (i.e., XX and ZZ are conditionally independent given YY), then:

I(X;Y)I(X;Z)I(X; Y) \geq I(X; Z)

with equality if and only if XZYX \to Z \to Y also forms a Markov chain.

In words: Processing data can only destroy information, never create it.

We’ll use the chain rule for mutual information and properties of conditional mutual information.

Lemma 1: Chain Rule for Mutual Information

Section titled “Lemma 1: Chain Rule for Mutual Information”

Statement: I(X;Y,Z)=I(X;Z)+I(X;YZ)I(X; Y, Z) = I(X; Z) + I(X; Y | Z)

Proof:

I(X;Y,Z)=H(X)H(XY,Z)I(X; Y, Z) = H(X) - H(X|Y,Z) I(X;Z)+I(X;YZ)=H(X)H(XZ)+H(XZ)H(XY,Z)I(X; Z) + I(X; Y|Z) = H(X) - H(X|Z) + H(X|Z) - H(X|Y,Z) =H(X)H(XY,Z)= H(X) - H(X|Y,Z) \quad \checkmark

Statement: XYZX \to Y \to Z is Markov iff I(X;ZY)=0I(X; Z | Y) = 0

Proof: XYZX \to Y \to Z means p(x,zy)=p(xy)p(zy)p(x, z | y) = p(x|y)p(z|y). Thus XZYX \perp Z | Y, so conditioning on YY makes XX and ZZ independent, hence I(X;ZY)=0I(X;Z|Y) = 0.

Assume XYZX \to Y \to Z is a Markov chain.

Apply the chain rule two ways:

First way: I(X;Y,Z)=I(X;Z)+I(X;YZ)I(X; Y, Z) = I(X; Z) + I(X; Y | Z)

Second way: I(X;Y,Z)=I(X;Y)+I(X;ZY)I(X; Y, Z) = I(X; Y) + I(X; Z | Y)

Since XYZX \to Y \to Z is Markov, by Lemma 2: I(X;ZY)=0I(X; Z | Y) = 0.

From the second expansion: I(X;Y,Z)=I(X;Y)+0=I(X;Y)I(X; Y, Z) = I(X; Y) + 0 = I(X; Y)

From the first expansion: I(X;Y)=I(X;Z)+I(X;YZ)I(X; Y) = I(X; Z) + I(X; Y | Z)

Since mutual information is non-negative, I(X;YZ)0I(X; Y | Z) \geq 0, therefore:

I(X;Y)I(X;Z)I(X; Y) \geq I(X; Z)

\square

This theorem has profound implications:

  1. No algorithm can extract more information about XX from ZZ than was in YY. If YY is a lossy compression of XX, and ZZ is computed from YY, then ZZ has even less information about XX.

  2. For the dissertation: Meme transmission is a Markov chain: Source ideaEncodingReceived idea\text{Source idea} \to \text{Encoding} \to \text{Received idea}. The DPI says the receiver can never have more information about the source than was in the transmitted message.

  3. Communication bound: This is why channel capacity matters—it limits how much information can traverse the channel.

  1. Sufficient statistics: If T(Y)T(Y) is a sufficient statistic for XX, then I(X;Y)=I(X;T(Y))I(X; Y) = I(X; T(Y))—no information is lost.

  2. Repeated processing: For any chain XY1Y2YnX \to Y_1 \to Y_2 \to \cdots \to Y_n: I(X;Y1)I(X;Y2)I(X;Yn)I(X; Y_1) \geq I(X; Y_2) \geq \cdots \geq I(X; Y_n)

  • The inequality extends to continuous random variables.
  • There’s a strengthened version involving the contraction coefficient.
  • Related: Fano’s inequality provides a lower bound on error probability.

The data processing inequality was implicit in Shannon’s 1948 paper but was formalized later. It’s sometimes called the “no free lunch theorem of information theory.”

  • Cover & Thomas, Elements of Information Theory, Theorem 2.8.1
  • Shannon (1948), implicitly in channel coding theorem