# Data Compression Algorithms ## Introduction - data compression - process of converting an input data stream into output data stream that has a smaller size - compression algorithm = encoding (compression) + decoding (decompression) - compression - lossless – the restored and original data are identical - lossy – the restored data are a reasonable approximation of the original - methods - static / adapative - streaming / block - block – we compress the data by blocks - goals – to save storage, to reduce the transmission bandwidth - measuring the performance - units - input data size … $u$ bytes (uncompressed) - compressed data size … $k$ bytes, $K$ bits - compression ratio … $k/u$ (written as percentage) - compression factor … $u:k$ - compression gain … $(u-k)/u$ (written as percentage) - average codeword length … $K/u$ (may be bpc or bpp) - bpc … bits per char - bpp … bits per pixel - relative compression (percent log ratio) … $100\ln(k/k')$ - $k'$ … size of data compressed by a standard algorithm - data corpora - Calgary Corpus – 14 files: text, graphics, binary files - Canterbury Corpus – 11 files + artificial corpus (4) + large corpus (3) + miscellaneous corpus (1) - Silesia Corpus - Prague Corpus - contests - Calgary Corpus Compression Challenge - Hutter Prize – Wikipedia compression - goal – encourage research in AI - limits of lossless compression - encoding $f:$ {$n$-bit strings} → {strings of length $\lt n$} - $|\text{Dom}f|=2^n$ - $|\text{Im}f|\leq 2^n-1$ - such $f$ cannot be injective → does not have an inverse function - let $M\subseteq\text{Dom} f$ such that $\forall s\in M:|f(s)|\leq 0.9n$ - $f$ injective on $M\implies|M|\leq2^{1+0.9n}-1$ - $n=100$ … $|M|/2^n\lt 2^{-9}$ - $n=1000$ … $|M|/2^n\lt 2^{-99}\approx 1.578\cdot 10^{-30}$ - we can't compress all data efficiently (but that does not matter, we often focus on specific instances of data) ## Statistical methods - basic concepts - source alphabet … $A$ - coding alphabet … $A_C$ - encoding … $f:A^*\to A^*_C$ - $f$ injective → uniquely decodable encoding - typical strategy - input string is factorized into concatenation of substrings - $s=s_1s_2\dots s_k$ - substrings = phrases - $f(s)=C(s_1)C(s_2)\dots C(s_k)$ - $C(s_i)$ … codeword - codes - source code … $K:A\to A^*_c$ - $K(s)$ … codeword for symbol $s\in A$ - encoding $K^*$ generated by code $K$ is the mapping $K^*(s_1s_2\dots s_n)=K(s_1)K(s_2)\dots K(s_n)$ for every $s\in A^*$ - $K$ is uniquely decodable $\equiv$ generates a uniquely decodable encoding $K^*$ - example - $\set{0,01,11}$ … is uniquely decodable - $\set{0,01,10}$ … is not, counterexample: $010$ - $K_1$ … fixed length - $K_2$ … no codeword is a prefix of another one - prefix codes - prefix code is a code such that no codeword is a prefix of another codeword - observation - every prefix code $K$ (over a binary alphabet) may be represented by a binary tree = prefix tree for $K$ - prefix tree may be used for decoding - idea: some letters are more frequent, let's assign shorter codewords to them - Shannon-Fano coding - input – source alphabet $A$, frequency $f(s)$ for every symbol $s\in A$ - output – prefix code for $A$ - algorithm - sort the source alphabet by symbol frequencies - divide the table into parts with similar sums of frequencies - append 0/1 to prefix, repeat - definition: the algorithm constructs an optimal prefix / uniquely decodable code $K\equiv$ $|K'^*(a_1a_2\dots a_n)|\geq |K^*(a_1a_2\dots a_n)|$ for every prefix (uniquely decodable) code $K'$ and for every string $a\in A^*$ with symbol frequencies $f$ - Shannon-Fano is not optimal - Fano's student David Huffman described a construction of optimal prefix code - divide and conquer algorithm ### Huffman code - constructs the tree from the bottom - starts with the symbols that have the lowest frequencies - sum of the frequencies is used as the frequency of the new node - greedy algorithm - interesting type of Huffman code: [Canonical Huffman code](https://en.wikipedia.org/wiki/Canonical_Huffman_code) - optimality of Huffman code - notation - alphabet $a$ of symbols that occur in the input string - $\forall s\in A:f(s)$ … frequency of the symbol - let $T$ be a prefix tree for alphabet $A$ with frequencies $f$ - leaves of $T$ … symbols of $A$ - let $d_T(s)$ denote the depth of a leaf $s$ in the tree $T$ - $d_T(s)=$ length of the path from root to $s$ - $d_T(s)=$ length of the codeword for symbols $s$ - cost $C(T)$ of tree $T$ - $C(T)=\sum_{z\in A}f(z)d_T(z)$ - $C(T)$ … length of the encoded message - lemma 0: if a prefix tree $T$ ($|V(T)|\geq 3$) contains a vertex with exactly one child, then $T$ is not optimal - proof: we could contract the edge between the vertex and its child - lemma 1 - let $A$ ($|A|\geq 2$) be an alphabet with frequencies $f$ - let $x,y\in A$ be symbols with the lowest frequencies - then there exists an optimal prefix tree for $A$ in which $x,y$ are siblings of maximum depth - proof - let $T$ be an optimal tree where $x,y$ are not siblings of maximum depth (instead, there are $a,b$ with maximum depth) - now by switching $a$ and $x$, we get $T'$ - $C(T)-C(T')=d_T(x)f(x)+d_T(a)f(a)-d_{T'}(a)f(a)-d_{T'}(x)f(x)=$ $\dots\geq 0$ - $T'$ is also optimal - we repeat the steps for $y$ - we get $T''$ - $T$ is optimal $\implies T''$ is optimal - lemma 2 - let $A$ be an alphabet with $f$ - let $x,y\in A$ be symbols with the lowest frequencies - let $T'$ be an optimal prefix tree for alphabet $A'=A\setminus\set{x,y}\cup\set{z}$ where $f(z)=f(x)+f(y)$ - then tree $T$ obtained from $T'$ by adding children $x$ and $y$ to $z$ is an optimal prefix tree for $A$ - proof - $T'$ optimal for $A'$ - we get $T$ by adding $x,y$ below $z$ (therefore $z$ is no longer a symbol of the alphabet) - there exists $T''$ optimal for $A$ in which $x,y$ are siblings of maximum depth - using lemma 1 - we label their parent $z$ - we remove $x,y$ and get $T'''$ for $A'$ - $C(T)=C(T')+f(x)+f(y)\leq C(T''')+f(x)+f(y)=C(T'')$ - clearly $C(T')\leq C(T''')$ as $T'$ is optimal - $T''$ optimal $\implies T$ optimal - theorem: algorithm Huffman produces an optimal prefix code - proof by induction - theorem (Kraft-McMillan inequality) - codewords lengths $l_1,l_2,\dots,l_n$ of a uniquely decodable code satisfy $\sum_i 2^{-l_i}\leq 1$ - conversely, if positive integers $l_1,l_2,\dots,l_n$ satisfy the inequality, then there is a prefix code with codeword lengths $l_1,l_2,\dots,l_n$ - corollary - Huffman coding algorithm procudes an optimal uniquely decodable code - if there was some better uniquely decodable code, there would be also a better prefix code (by Kraft-McMillan inequality) so Huffman coding would not be optimal - idea: we could encode larger parts of the message, it might be better than Huffman coding - Huffman coding is not deterministic, there is not specified the order in which we take the symbols to form the tree – it may lead to optimal trees of different height - can we modify Huffman algorithm to guarantee that the resulting code minimizes the maximum codeword length? - yes, when constructing the tree, we can sort the nodes not only by their frequencies but also by the depth of their subtree - what is the maximum height of a Huffman tree for an input message of length $n$? - we find the minimal input size to get the Huffman tree of height $k$ - Huffman tree of height $k\implies n\geq f_{k+1}$ - where $f_0=1,\;f_1=1,\;f_{k+2}=f_{k+1}+f_k$ - $n\lt f_{k+2}\implies$ max. codeword length $\leq k$ - is there an optimal prefix code which cannot be obtained using Huffman algorithm? - yes - generalize the construction of binary Huffman code to the case of $m$-ary coding alphabet ($m\gt 2$) - the trivial approach does not produce an optimal solution - for the ternary coding alphabet, we can modify the first step – we will only take two leaves to form a subtree - implementation notes - bitwise input and output (usually, we work with bytes) - overflow problem (for integers etc.) - code reconstruction necessary for decoding - adaptive compression - static × adaptive methods - statistical data compression consists of two parts: modeling and coding - what if we cannot read the data twice? - we will use the adaptive model - if we know that the code contains 2 codewords of length 2 and 4 codewords of length 3, we can construct canonical Huffman code (?) - 00, 01, 100, 101, 110, 111 - adaptive Huffman code - brute force strategy – we reconstruct the whole tree - after each frequency change - after reading $k$ symbols - after change of order of symbols sorted by frequencies - characterization of Huffman trees - Huffman tree – binary tree with nonnegative vertex weights - two vertices of a binary tree with the same parent are called siblings - binary tree with nonnegative vertex weights has a sibling property if - $\forall$ parent: weight(parent) = $\sum$ weight(child) - each vertex except root has a sibling - vertices can be listed in order of non-increasing weight with each vertex adjacent to its sibling - theorem (Faller 1973): binary tree with nonnegative vertex weights is a Huffman tree iff it has the sibling property - FGK algorithm - we maintain the sibling property - zero-frequency problem - how to encode a novel symbol - 1st solution: initializace Huffman tree with all symbols of the source alphabet, each with weight one - 2nd solution: initialize Huffman tree with a special symbol esc - encode the 1st occurence of symbol s as a Huffman code of esc followed by s - insert a new leaf representing s into the Huffman tree - average codeword length … $l_{FGK}\leq l_H+O(1)$ - implementation problems - overflow - we can multiply frequencies by $\frac12$ - this may lead to tree reconstruction - statistical properties of the input may be changing over time – it might be useful to reduce the influence of older frequencies in time - by multiplying by $x\in(0,1)$ - we'll use integral arithmetic → we need to scale the numbers ## Arithmetic coding - each string $s\in A^*$ associate with a subinterval $I_s\subseteq[0,1]$ such that - $s\neq s'\implies I_s\cap I_{s'}=\emptyset$ - length of $I_s=p(s)$ - $C(s)=$ number from $I_s$ with the shortest code possible - we'll first assume that $p(a_1a_2\dots a_m)=p(a_1)\cdot p(a_2)\cdot \ldots \cdot p(a_m)$ - encoding - we count the frequencies → probabilities - we sort the symbols in some order - we assign intervals to the symbols in the order - to encode the message, we subdivide the intervals - for example, let's say that $I_B=(0.2,0.3)$ and $I_I=(0.5,0.6)$ - then $I_{BI}=(0.25,0.26)$ - the shorter the message, the longer the interval - we have to store the length of the input or encode a special symbol (end of file) to be able to distinguish between the encoded string and its prefixes - decoding - we reconstruct the input string from the left - we inverse the mapping - problem: using floating point arithmetic leads to rounding errors → we will use integers - underflow may happen that way if we use too short integers - another problem - we use bit shifting and assume that the first bits of both bounds are the same - what if the first bits differ?