Data Compression Algorithms

Introduction Statistical methods Huffman code Arithmetic coding

Introduction

data compression
- process of converting an input data stream into output data stream that has a smaller size
- compression algorithm = encoding (compression) + decoding (decompression)
compression
- lossless – the restored and original data are identical
- lossy – the restored data are a reasonable approximation of the original
methods
- static / adapative
- streaming / block
  - block – we compress the data by blocks
goals – to save storage, to reduce the transmission bandwidth
measuring the performance
- units
  - input data size … $u$ bytes (uncompressed)
  - compressed data size … $k$ bytes, $K$ bits
  - compression ratio … $k/u$ (written as percentage)
  - compression factor … $u:k$
  - compression gain … $(u-k)/u$ (written as percentage)
  - average codeword length … $K/u$ $K / u$ (may be bpc or bpp)
    - bpc … bits per char
    - bpp … bits per pixel
  - relative compression (percent log ratio) … $100\ln(k/k')$ $100 ln (k / k^{'})$
    - $k'$ … size of data compressed by a standard algorithm
- data corpora
  - Calgary Corpus – 14 files: text, graphics, binary files
  - Canterbury Corpus – 11 files + artificial corpus (4) + large corpus (3) + miscellaneous corpus (1)
  - Silesia Corpus
  - Prague Corpus
- contests
  - Calgary Corpus Compression Challenge
  - Hutter Prize – Wikipedia compression
    - goal – encourage research in AI
limits of lossless compression
- encoding $f:$ $f :$ { $n$ $n$ -bit strings} → {strings of length $\lt n$ $< n$ }
  - $|\text{Dom}f|=2^n$
  - $|\text{Im}f|\leq 2^n-1$
  - such $f$ cannot be injective → does not have an inverse function
- let $M\subseteq\text{Dom} f$ $M \subseteq Dom f$ such that $\forall s\in M:|f(s)|\leq 0.9n$ $\forall s \in M : ∣ f (s) ∣ \leq 0.9 n$
  - $f$ injective on $M\implies|M|\leq2^{1+0.9n}-1$
  - $n=100$ … $|M|/2^n\lt 2^{-9}$
  - $n=1000$ … $|M|/2^n\lt 2^{-99}\approx 1.578\cdot 10^{-30}$
- we can't compress all data efficiently (but that does not matter, we often focus on specific instances of data)

Statistical methods

basic concepts
- source alphabet … $A$
- coding alphabet … $A_C$
- encoding … $f:A^*\to A^*_C$
- $f$ injective → uniquely decodable encoding
typical strategy
- input string is factorized into concatenation of substrings
  - $s=s_1s_2\dots s_k$
- substrings = phrases
- $f(s)=C(s_1)C(s_2)\dots C(s_k)$ $f (s) = C (s_{1}) C (s_{2}) \dots C (s_{k})$
  - $C(s_i)$ … codeword
codes
- source code … $K:A\to A^*_c$
- $K(s)$ … codeword for symbol $s\in A$
- encoding $K^*$ generated by code $K$ is the mapping $K^*(s_1s_2\dots s_n)=K(s_1)K(s_2)\dots K(s_n)$ for every $s\in A^*$
- $K$ is uniquely decodable $\equiv$ generates a uniquely decodable encoding $K^*$
- example
  - $\set{0,01,11}$ … is uniquely decodable
  - $\set{0,01,10}$ … is not, counterexample: $010$
  - $K_1$ … fixed length
  - $K_2$ … no codeword is a prefix of another one
prefix codes
- prefix code is a code such that no codeword is a prefix of another codeword
- observation
  - every prefix code $K$ (over a binary alphabet) may be represented by a binary tree = prefix tree for $K$
  - prefix tree may be used for decoding
- idea: some letters are more frequent, let's assign shorter codewords to them
Shannon-Fano coding
- input – source alphabet $A$ , frequency $f(s)$ for every symbol $s\in A$
- output – prefix code for $A$
- algorithm
  - sort the source alphabet by symbol frequencies
  - divide the table into parts with similar sums of frequencies
  - append 0/1 to prefix, repeat
- definition: the algorithm constructs an optimal prefix / uniquely decodable code $K\equiv$ $|K'^*(a_1a_2\dots a_n)|\geq |K^*(a_1a_2\dots a_n)|$ for every prefix (uniquely decodable) code $K'$ and for every string $a\in A^*$ with symbol frequencies $f$
- Shannon-Fano is not optimal
  - Fano's student David Huffman described a construction of optimal prefix code
- divide and conquer algorithm

Huffman code

constructs the tree from the bottom
starts with the symbols that have the lowest frequencies
sum of the frequencies is used as the frequency of the new node
greedy algorithm
interesting type of Huffman code: Canonical Huffman code
optimality of Huffman code
- notation
  - alphabet $a$ of symbols that occur in the input string
  - $\forall s\in A:f(s)$ … frequency of the symbol
  - let $T$ $T$ be a prefix tree for alphabet $A$ $A$ with frequencies $f$ $f$
    - leaves of $T$ … symbols of $A$
  - let $d_T(s)$ $d_{T} (s)$ denote the depth of a leaf $s$ $s$ in the tree $T$ $T$
    - $d_T(s)=$ length of the path from root to $s$
    - $d_T(s)=$ length of the codeword for symbols $s$
  - cost $C(T)$ $C (T)$ of tree $T$ $T$
    - $C(T)=\sum_{z\in A}f(z)d_T(z)$
    - $C(T)$ … length of the encoded message
- lemma 0: if a prefix tree $T$ ( $|V(T)|\geq 3$ ) contains a vertex with exactly one child, then $T$ is not optimal
- proof: we could contract the edge between the vertex and its child
- lemma 1
  - let $A$ ( $|A|\geq 2$ ) be an alphabet with frequencies $f$
  - let $x,y\in A$ be symbols with the lowest frequencies
  - then there exists an optimal prefix tree for $A$ in which $x,y$ are siblings of maximum depth
- proof
  - let $T$ be an optimal tree where $x,y$ are not siblings of maximum depth (instead, there are $a,b$ with maximum depth)
  - now by switching $a$ and $x$ , we get $T'$
  - $C(T)-C(T')=d_T(x)f(x)+d_T(a)f(a)-d_{T'}(a)f(a)-d_{T'}(x)f(x)=$ $\dots\geq 0$
  - $T'$ is also optimal
  - we repeat the steps for $y$
  - we get $T''$
  - $T$ is optimal $\implies T''$ is optimal
- lemma 2
  - let $A$ be an alphabet with $f$
  - let $x,y\in A$ be symbols with the lowest frequencies
  - let $T'$ be an optimal prefix tree for alphabet $A'=A\setminus\set{x,y}\cup\set{z}$ where $f(z)=f(x)+f(y)$
  - then tree $T$ obtained from $T'$ by adding children $x$ and $y$ to $z$ is an optimal prefix tree for $A$
- proof
  - $T'$ optimal for $A'$
  - we get $T$ by adding $x,y$ below $z$ (therefore $z$ is no longer a symbol of the alphabet)
  - there exists $T''$ $T^{''}$ optimal for $A$ $A$ in which $x,y$ $x, y$ are siblings of maximum depth
    - using lemma 1
    - we label their parent $z$
  - we remove $x,y$ and get $T'''$ for $A'$
  - $C(T)=C(T')+f(x)+f(y)\leq C(T''')+f(x)+f(y)=C(T'')$ $C (T) = C (T^{'}) + f (x) + f (y) \leq C (T^{'''}) + f (x) + f (y) = C (T^{''})$
    - clearly $C(T')\leq C(T''')$ as $T'$ is optimal
  - $T''$ optimal $\implies T$ optimal
- theorem: algorithm Huffman produces an optimal prefix code
  - proof by induction
- theorem (Kraft-McMillan inequality)
  - codewords lengths $l_1,l_2,\dots,l_n$ of a uniquely decodable code satisfy $\sum_i 2^{-l_i}\leq 1$
  - conversely, if positive integers $l_1,l_2,\dots,l_n$ satisfy the inequality, then there is a prefix code with codeword lengths $l_1,l_2,\dots,l_n$
- corollary
  - Huffman coding algorithm procudes an optimal uniquely decodable code
  - if there was some better uniquely decodable code, there would be also a better prefix code (by Kraft-McMillan inequality) so Huffman coding would not be optimal
idea: we could encode larger parts of the message, it might be better than Huffman coding
Huffman coding is not deterministic, there is not specified the order in which we take the symbols to form the tree – it may lead to optimal trees of different height
can we modify Huffman algorithm to guarantee that the resulting code minimizes the maximum codeword length?
- yes, when constructing the tree, we can sort the nodes not only by their frequencies but also by the depth of their subtree
what is the maximum height of a Huffman tree for an input message of length $n$ $n$ ?
- we find the minimal input size to get the Huffman tree of height $k$
- Huffman tree of height $k\implies n\geq f_{k+1}$ $k ⟹ n \geq f_{k + 1}$
  - where $f_0=1,\;f_1=1,\;f_{k+2}=f_{k+1}+f_k$
- $n\lt f_{k+2}\implies$ max. codeword length $\leq k$
is there an optimal prefix code which cannot be obtained using Huffman algorithm?
- yes
generalize the construction of binary Huffman code to the case of $m$ $m$ -ary coding alphabet ( $m\gt 2$ $m > 2$ )
- the trivial approach does not produce an optimal solution
- for the ternary coding alphabet, we can modify the first step – we will only take two leaves to form a subtree
implementation notes
- bitwise input and output (usually, we work with bytes)
- overflow problem (for integers etc.)
- code reconstruction necessary for decoding
adaptive compression
- static × adaptive methods
- statistical data compression consists of two parts: modeling and coding
- what if we cannot read the data twice?
  - we will use the adaptive model
if we know that the code contains 2 codewords of length 2 and 4 codewords of length 3, we can construct canonical Huffman code (?)
- 00, 01, 100, 101, 110, 111
adaptive Huffman code
- brute force strategy – we reconstruct the whole tree
  - after each frequency change
  - after reading $k$ symbols
  - after change of order of symbols sorted by frequencies
- characterization of Huffman trees
  - Huffman tree – binary tree with nonnegative vertex weights
    - two vertices of a binary tree with the same parent are called siblings
    - binary tree with nonnegative vertex weights has a sibling property if
      - $\forall$ parent: weight(parent) = $\sum$ weight(child)
      - each vertex except root has a sibling
      - vertices can be listed in order of non-increasing weight with each vertex adjacent to its sibling
  - theorem (Faller 1973): binary tree with nonnegative vertex weights is a Huffman tree iff it has the sibling property
  - FGK algorithm
    - we maintain the sibling property
  - zero-frequency problem
    - how to encode a novel symbol
    - 1st solution: initializace Huffman tree with all symbols of the source alphabet, each with weight one
    - 2nd solution: initialize Huffman tree with a special symbol esc
      - encode the 1st occurence of symbol s as a Huffman code of esc followed by s
      - insert a new leaf representing s into the Huffman tree
  - average codeword length … $l_{FGK}\leq l_H+O(1)$
- implementation problems
  - overflow
    - we can multiply frequencies by $\frac12$ $\frac{1}{2}$
      - this may lead to tree reconstruction
    - statistical properties of the input may be changing over time – it might be useful to reduce the influence of older frequencies in time
      - by multiplying by $x\in(0,1)$
    - we'll use integral arithmetic → we need to scale the numbers

Arithmetic coding

each string $s\in A^*$ $s \in A^{*}$ associate with a subinterval $I_s\subseteq[0,1]$ $I_{s} \subseteq [0, 1]$ such that
- $s\neq s'\implies I_s\cap I_{s'}=\emptyset$
- length of $I_s=p(s)$
- $C(s)=$ number from $I_s$ with the shortest code possible
we'll first assume that $p(a_1a_2\dots a_m)=p(a_1)\cdot p(a_2)\cdot \ldots \cdot p(a_m)$
encoding
- we count the frequencies → probabilities
- we sort the symbols in some order
- we assign intervals to the symbols in the order
- to encode the message, we subdivide the intervals
  - for example, let's say that $I_B=(0.2,0.3)$ and $I_I=(0.5,0.6)$
  - then $I_{BI}=(0.25,0.26)$
- the shorter the message, the longer the interval
- we have to store the length of the input or encode a special symbol (end of file) to be able to distinguish between the encoded string and its prefixes
decoding
- we reconstruct the input string from the left
- we inverse the mapping
problem: using floating point arithmetic leads to rounding errors → we will use integers
- underflow may happen that way if we use too short integers
another problem
- we use bit shifting and assume that the first bits of both bounds are the same
- what if the first bits differ?