Information Theory – Chapter 2: Property of Entropy

In this chapter, we will further discover the properties of Entropy.

Theorem(Jensen’s inequality): If function $f$ is convex and $X$ is a discrete random variable. It takes values on $\mathcal{X}(x_1, x_2, \cdots, x_n), P_X=(p_1, \cdots, p_n)$ . Then

\sum_{i=1}^n p_if(x_i)\ge f(\sum_{i=1}^np_ix_i).

This could also be written as

\mathbb{E}[f(X)]\ge f(\mathbb{E}[X]).

If $f$ is strictly convex, and if equality holds, then $X=\mathbb{E}[X]$ .

Theorem: $P(x), Q(x)$ are distribution on $\mathcal{X}$ , for $x\in \mathcal{X}$ . So $D(P||Q)\ge 0$ with equality iff $P(x)=Q(x), \forall x\in \mathcal{X}.$
Proof: Let

A=\{x:P(x)>0\},

we may assume $Q(x)>0, x\in A$ . (else if $Q(x)=0$ for some $x\in A$ then $D=\infty$ )
Then,

\begin{aligned} -D(P||Q)&=\sum_{x\in A}P(x)\log\frac{P(x)}{Q(x)}\\ &=\sum_{x\in A}P(x)\log\frac{Q(x)}{P(x)}\\ &\le \log \sum_{x\in A} P(x) \frac{Q(x)}{P(x)}\\ &=\log\sum_{x\in X}Q(x)\\ &=0 \end{aligned}

Since $f(x)=\log(x)$ is concave $\Rightarrow$ equality holds iff $\frac{Q(x)}{P(x)}=C, \forall x\in \mathcal{X}$ . As

1=\sum_{x\in \mathcal{X}} Q(x)=C\sum_{x\in \mathcal{X}} P(x)=C \Rightarrow P(x)=Q(x).∎

Corollary: $I(X;Y\ge 0)$ with equality iff $X$ and $Y$ are independent.
Proof:

\begin{aligned} I(X;Y)=D(P_{X, Y}||P_XP_Y)∎ \end{aligned}

Corollary: $D(P_{Y|X}||Q_{Y|X})\ge 0$ with equality iff $P_{Y|X}(y|x)=Q_{Y|X}(y|x)$ for all $(y, x)\in Y\times X$ s.t. $P_X(x)>0$ .

Corollary: $I(X, Y|Z)\ge 0$ with equality iff $P_{XY|Z}(xy|z)=P_{X|Z}(x|z)P_{Y|Z}(y|z).$

Theorem: $H(X)\le \log |X|$ with equality iff $X$ is uniformly distribution over $X$ .
Proof:
Let $U(x)=\frac{1}{|X|}$ for $x\in X$ .

\begin{aligned} D(P_X||U)&=\sum_{x\in X} P(x)\log\frac{P_X(x)}{U(x)}\\ &=\log |X|-H(X)\ge 0∎ \end{aligned}

Theorem: $H(X|Y)\le H(X)$ .
Proof: $H(X)-H(X|Y)=I(X;Y)\ge 0.∎$

However, note that condition on an specific event may/maynot decrease entropy:

H(X|Y=y)\ ?\ H(X)

Theorem(Independent bound): $H(X_1, \cdots, X_n)\le \sum_{i=1}^nH(X_i)$ with equality iff $X_1, \cdots, X_n$ are independent.

Han’s inequality:

\begin{aligned} H(X_1, X_2, \cdots, X_n)\le \frac{1}{n-1}\sum_{i=1}^nH(X_1, \cdots, X_{i-1}, X_{i+1}, \cdots, X_n).\\ \forall i, H(X_1, X_2, \cdots, X_n)=H(X_1, \cdots, X_{i-1}, X_{i+1}, \cdots, X_n)\\ +H(X_i|X_1, \cdots, X_{i-1}, X_{i+1}, \cdots, X_n) \end{aligned}

\begin{aligned} \sum_{i=1}^nH(X_1, X_2, \cdots, X_n)&=\sum_{i=1}^n H(X_1, \cdots, X_{i-1}, X_{i+1}, \cdots, X_n) \\ &+ \sum_{i=1}^n H(X_i|X_1, \cdots, X_{i-1}, X_{i+1}, \cdots, X_n) \\ &\le \sum_{i=1}^nH(X_1, \cdots, X_{i-1}, X_{i+1}, \cdots, X_n) \\ & + \sum_{i=1}^nH(X_i|X_1, \cdots, X_{i-1}) \\ &= \sum_{i=1}^nH(X_1, \cdots, X_{i-1}, X_{i+1}, \cdots, X_n) + H(X_1,\cdots, X_n) \\ \Rightarrow & (n-1)H(X_1, \cdots, X_n)\le \sum_{i=1}^nH(X_1, \cdots, X_{i-1}, X_{i+1}, \cdots, X_n).∎ \end{aligned}

We could demonstrate the Han’s inequality by another method.

Definition(Submodular functions): Let

S=\{1, 2, \cdots, n\}, f:2^S\rightarrow \mathbb{R}.

The function $f$ is called submodular if

\forall T_1, T_2\subset S, f(T_1\cup T_2)+f(T_1\cap T_2)\le f(T_1)+f(T_2).

This definition is equivalant to: if $\forall T_1\subset T_2\subset S, b\in S-T_2$ ,

f(T_2\cup\{b\})-f(T_2)\le f(T_1\cup \{b\}-f(T_1)),

which means adding elements brings dimishing marginal return.

Theorem: Let discreterandom variable

X=\{x_1, \cdots, x_n\} %\text{ with } P_X=

Let

T\subset\{1, \cdots n\},

then the function $T\rightarrow H(X_T)$ is submodular.

So with this theorem, we could demonstrate Han’s inequality more conveniently.

Denote

\begin{aligned} T_1=\{1, \cdots, i-1\}, T_2=\{1, \cdots, i-1, i+1, \cdots, n\}, b=i. \end{aligned}

Using equivalent definition,

\begin{aligned} & H(X_{T_2\cup\{i\}})-H(X_{T_2}) \le H(X_{T_1\cup \{i\}})-H(X_{T_1})\\ \end{aligned}

\begin{aligned} \Rightarrow H(X_1, \cdots, X_n) – H(X_1,\cdots, X_{i-1}, X_{i+1}, X_n) &\le H(X_1, \cdots, X_i)-H(X_1, \cdots, X_{i-1})\\ &=H(X_i|X_1, \cdots, X_{i-1}) \end{aligned}

Summing over $i=1, 2, \cdots, n$ , we obtain Han’s inequality.

Theorem(Log-sum inequality): Let $a_i\ge 0, b_i\ge 0, i=1, 2, \cdots, n.$ Let $a=\sum_{i=1}^n a_i, b=\sum_{i=1}^n b_i$ ,

\sum_{i=1}^n a_i\log\frac{a_i}{b_i}\ge a\log\frac{a}{b}.

Proof:

\begin{aligned} \sum_{i=1}^n a_i \log \frac{a_i}{b_i} &= \sum_{i=1}^n b_i \frac{a_i}{b_i}\log\frac{a_i}{b_i}\\ &=b\sum_{i=1}^n \frac{b_i}{b} \frac{a_i}{b_i} \log\frac{a_i}{b_i}\\ &\ge b(\sum_{i=1}^n \frac{b_i}{b}\frac{a_i}{b_i})\log(\sum_{i=1}^n\frac{b_i}{b}\frac{a_i}{b_i})\\ &=b \frac{a}{b}\log\frac{a}{b}\\ &=a\log\frac{a}{b}. ∎ \end{aligned}

Theorem(Convexity of $D$ ): Let $P(X), Q(X)$ be PMFs, then $D(P||Q)$ is convex in $(P, Q)$ , i.e., $D(\lambda P_1+(1-\lambda)P_2||\lambda Q_1+(1-\lambda)Q_2)\le \lambda D(P_1||Q_1) + (1-\lambda) D(P_2||Q_2)$ .

Proof: Apply log-sum inequality termwise. Fix $x\in X$ ,

(\lambda P_1(x)+(1-\lambda)P_2(x))\log\frac{\lambda P_1(x)+(1-\lambda)P_2(x)}{\lambda Q_1+(1-\lambda)Q_2}\le \lambda P_1\log\frac{P_1(x)}{Q_1(x)}+(1-\lambda) P_2(x)\log\frac{P_2(x)}{Q_2(x)}

Summing over the $x\in X$ , we will obtain the desired results. ∎

Theorem(Convexity of $H$ ): $H(P)$ is a concave function of $P$ .
Proof:

H(P)=\log|X|-D(P||U)

Note that $D(P||U)=\sum_{x\in X}P(x)\log\frac{P(x)}{|X|}.$ ∎

Theorem: Given RVs $X, Y$ with joint PMF $P_{XY}$ ,
(1). $I(X; Y)$ is concave in $P_X$ for a fixed $P_{Y|X}$ .
(2). $I(X;Y)$ is convex in $P_{Y|X}$ for a fixed $P_X$ .
Proof:
(1). Note that

I(X, Y)=H(Y)-\sum_{x\in X} P_X(x)H(Y|X=x).

P_Y(y)=\sum_{x\in X} P_X(x)P_{Y|X}(y|x),

which means $P_Y$ is linear in $P_X$ .
Accordingly, $H(Y)$ is concave in $P_X$ , so is $I(X;Y)$ when $P_{Y_X}$ is fixed.

(2). Fix $P_X$ .
Let $P_1(y|x), P_2(y|x)$ be two possibility for $P_{Y|X}$ .
Denote $P_i(x, y)=P_X(x)P_i(y|x), P_i(y)=\sum_{x\in X}P_i(x, y)$ .
Choose $\lambda\in [0,1)$ and consider the mixtures:

P_\lambda(y|x)=\lambda P_1(y|x)+(1-\lambda)P_2(y|x).

\begin{aligned} P_\lambda(x, y)&=P_X(x)P_\lambda(y|x)=\lambda P_1(y|x)P_X(x)+(1-\lambda)P_2(y|x) P_X(x) \\ P_\lambda(y)&=\sum_{x\in X} P_\lambda(x, y)=\lambda P_1(y) + (1-\lambda) P_2(y). \end{aligned}

Let $Q_\lambda(x, y)=P_X(x)P_\lambda(y)=\lambda P_1(y)P_X(x)+(1-\lambda)P_2(y)P_X(x)=\lambda Q_1(x,y)+(1-\lambda)Q_2(x, y)$ .

So, with convexity of $D$ ,

D(P_\lambda(x,y)||Q_\lambda(x, y))\le \lambda D(P_1(x, y)||P_X(x)P_1(y))+(1-\lambda)D(P_2(x, y)||P_X(x)P_2(y)).

Information Theory – Chapter 2: Property of Entropy

留下回复 取消回复

留下回复取消回复