Information Theory – Chapter X Lempel-Ziv coding
This chaper is to talk about LZ78 and its asympotic optimaling.
LZ78: Ride along the sequence: if what you have observed after the last reset is not in the dictionary, place it in the dictionary with (\text{pointer to the prefix}, \text{last symbol}) in the encoding. Below shows an example:
For a 0-1 string (0000001101010000011010), first parse the string into phrases: (0,00,000,1,10,101,0000,01,1010). And then encode the phrases as two parts, the first is the index of the prefix(if no prefix, set to 0) and the second is the last digit. The encoding result is shown below:
index of prefix | last digit |
---|---|
0 | (0000, 0) |
00 | (0001, 0) |
000 | (0010, 0) |
1 | (0000, 1) |
10 | (0100, 0) |
101 | (0101, 1) |
0000 | (0011, 0) |
01 | (0001, 1) |
1010 | (0110, 0) |
Now we consider about its performance.
The worst case: The number of phrases C(n) in a sequence of length n satisfying
Proof:
First denoted by n_k the sum of lengths of all strings of length \le k. So
C(n_k)\le \sum_{j=1}^k2^j = 2^{k+1}-2&=\frac{(k-1)2^{k+1}-2(k-1)}{k-1} \\
&<\frac{(k-1)2^{k+1}+2}{k-1} = \frac{n_k}{k-1}.
\end{aligned}
Let n_k\le n<n_{k+1}, more precisely, we can write n=n_k+m
C(n)&\le \frac{n_k}{k-1} + \frac{m}{k-1}\\
&\le\frac{n_k}{k-1} + \frac{m}{k-1}\\
&=\frac{n}{k-1}.
\end{aligned}
To bound k, note that
&n\ge n_k\ge 2^k \Rightarrow k\le \log n \\
&n< n_{k+1}=k2^{k+2}+2\le (k+2)2^{k+2} \le (\log n +2)2^{k+2}
\end{aligned}
\Rightarrow k-1&\ge \log n – \log(\log n +2) -3 \\
&\ge (1-\epsilon_n)\log n
\end{aligned}
where
So
For random string, LZ78 asymptotically approaches the entropy.
Random cases: Suppose X is a memoryless source with PMF P_X(i)=P_i. P(x^n)=\prod_{i=1}^nP_X(x_i)=\prod_{i=1}^n P_{x_i}. Parse x^n=(x_1, x_2, \cdots, x_n) into distinct phrases (x_1, x_2, \cdots, x_n)=(y_1, y_2, \cdots , y_C). So P(x^n)=\prod_{i=1}^C P(y_i).
Proof:
Let C_L be the number of phrases of length l among y_i, i=1, \cdots, c.
Before going on, we prove following lemma first:
Ziv’s Lemma:
Proof of lemma:
We know that all y_i are distinct, so
&\sum_{y_i:|y_i|=l}\le 1 \\
\Rightarrow& \prod_{y_i:|y_i|=l} P(y_i)\le (\frac{1}{c_l})^{c_l} \\
\Rightarrow& \log P(x^n)\le \sum_{l}c_l\log \frac{1}{c_l} ∎
\end{aligned}
Suppose there are m_i entries of symbol i among x_1, \cdots, x_n. Then
Invoking LLN, m_i \approx nP_i, so
So with Ziv’s lemma,
So we will prove that LZ comporesses to entropy if we show that:
We note that
where -\sum_l\frac{c_l}{C}\log\frac{c_l}{C} is entropy of the distribution Q=(q_i=\frac{c_i}{C}, i=1, 2, \cdots).
Consider the following maximization problem:
\max H(Q)\\
s.t. \sum_l{lc_l=n}
\end{aligned}
An exact solution can be found, but a very crude estimate is
H(Q)=o(log\frac{n}{C}) &\Rightarrow nH(P)\ge\sum_{l}c_l\log c_l=C\log C-o(\log\frac{n}{C})\\
&\Rightarrow \frac{C\log C}{n}\le H(P) + \frac{o(\log(\frac{n}{C}))}{n}. ∎
\end{aligned}
As C is the number of the phrases, \log C is the length to be coded. so \frac{C\log C}{n} is bounded by H(P). asymptotically optimal holds.