Information Theory – Chapter 3: Important Inequality

Information Theory – Chapter 3: Important Inequality

Data Processing Inequality

Let X\rightarrow Y\rightarrow Z form a Markov chain if P_{XYZ}(x,y,z)=P_X(x)P_{Y|X}(y|x)P_{Z|Y}(z|y).

Equivalently, Z and X are conditional independent given Y.

Theorem 1(Data processing inequality): If X\rightarrow Y\rightarrow Z, then I(X;Y)\ge I(X, Z). Equality holds if and only if I(X;Y|Z)=0.

Proof: I(X;Y,Z)=I(X;Z)+I(X;Y|Z)=I(X;Y)+I(X;Z|Y)=I(X;Y)
Since I(X;Y|Z)\ge 0,I(X;Z|Y)=0,
\Rightarrow I(X;Z)\le I(X;Y).

Fano’s Inequality

Given RV Y, estimate RV X with H(X|Y)>0. For X\sim P_X(x),x\in\mathcal{X}. Observe Y, which is related X by P_{Y|X}(y|x). And then compute \hat{X}=g(Y) where g:\mathcal{Y}\rightarrow \hat{\mathcal{X}}. The goal is to bound the error probability P_e\triangleq Pr(X\neq g(Y)).

Theorem(Fano’s inequality): For any estimate \hat X such that X\rightarrow Y\rightarrow \hat{X}, the error probability P_e satisfies that

h(P_e)+P_e\log\vert\mathcal{X}\vert\ge H(X\vert\hat X)\ge H(X\vert Y).

And another weaker version is

1+P_e\log\vert\mathcal{X}\vert\ge H(X\vert Y) \Leftrightarrow P_e\ge \frac{H(X\vert Y)-1}{\log\vert\mathcal{X}\vert}

Proof: For H(X\vert\hat X)\ge H(X\vert Y), just need to observe I(X;\hat X)=H(X)-H(X\vert \hat X)\le I(X;Y)=H(X)-H(X\vert Y).

Let E=\textbf{1}(\hat X\neq X) be the RV.

H(E,X\vert \hat X)=H(X\vert \hat X)+H(E\vert \hat X,X)=H(E\vert \hat X)+H(X\vert E,\hat X).

With H(E\vert \hat X,X)=0,H(E\vert \hat X)\le H(E)=h(P_e),

H(X\vert E,\hat X)=P(E=0)H(X\vert E=0,\hat X)+P(E=1)H(X\vert E=1,\hat X)=P(E=1)H(X\vert E=1,\hat X)=P(E=1)H(X\vert X\neq\hat X)\le P_e\log\vert \mathcal{X}\vert.

\Rightarrow H(E,X\vert\hat X)=H(X\vert\hat X)\le h(P_e)+P_e\log\vert \mathcal{X}\vert

\Rightarrow H(X\vert \hat X)\le h(P_e)+P_e\log\vert \mathcal{X}\vert.

Corollary 1: For any RVs X and Y, let p=P(Y\neq X). Take \hat X=Y, then H(X\vert Y)\le h(p)+p\log\vert\mathcal{X}\vert.

Corollary 2: Assume \hat X=X, i.e., g:\mathcal{Y}\rightarrow \mathcal{X}, then H(X\vert Y)\le h(p)+p\log(\vert \mathcal{X}\vert-1).

Remark: Fano’s inequality is sharp.(i.e. the equality can be achieved).

Lemma 1: Let X,X^\prime be i.i.d with H(X). Then Pr(X=X^\prime)\ge 2^{-H(X)}.

Proof: Let X\sim P_X(x), 2^{-H(X)}=2^{\mathbb{E}\log P_X(x)}\le \mathbb{E}2^{\log P_X(x)}=\mathbb{E}P_X(x)=\sum_{x\in\mathcal{X}}P_X(x)=Pr(X=X^\prime).

For equality, when \log P_X(x) is constant \Leftrightarrow P_X(x)=2^{\text{constant}}.

Corollary 3: Let X be independent with X^\prime;P_X(x)=P(x),x\in\mathcal{X};P_{X^\prime}(x)=Q(x),x\in\mathcal{X}. Then Pr(X=X^\prime)\ge \max(2^{-H(P)-D(P\Vert Q)},2^{-H(Q)-D(Q\Vert P)}).

Proof: $2^{-H(P)-D(P\Vert Q)}=2^{\mathbb{E}X\log P(x)-\mathbb{E}_X\log\frac{P(x)}{Q(x)}}=2^{\mathbb{E}_X\log(Q(x))}\le\mathbb{E}_X2^{\log Q(X)}=\mathbb{E}_XQ(x)=\sum{x\in\mathcal{X}}P(x)Q(x)=Pr(X=X^\prime)$.