# Data Compression Algorithms

## Introduction

- data compression
	- process of converting an input data stream into output data stream that has a smaller size
	- compression algorithm = encoding (compression) + decoding (decompression)
- compression
	- lossless – the restored and original data are identical
	- lossy – the restored data are a reasonable approximation of the original
- methods
	- static / adapative
	- streaming / block
		- block – we compress the data by blocks
- goals – to save storage, to reduce the transmission bandwidth
- measuring the performance
	- units
		- input data size … $u$ bytes (uncompressed)
		- compressed data size … $k$ bytes, $K$ bits
		- compression ratio … $k/u$ (written as percentage)
		- compression factor … $u:k$
		- compression gain … $(u-k)/u$ (written as percentage)
		- average codeword length … $K/u$ (may be bpc or bpp)
			- bpc … bits per char
			- bpp … bits per pixel
		- relative compression (percent log ratio) … $100\ln(k/k')$
			- $k'$ … size of data compressed by a standard algorithm
	- data corpora
		- Calgary Corpus – 14 files: text, graphics, binary files
		- Canterbury Corpus – 11 files + artificial corpus (4) + large corpus (3) + miscellaneous corpus (1)
		- Silesia Corpus
		- Prague Corpus
	- contests
		- Calgary Corpus Compression Challenge
		- Hutter Prize – Wikipedia compression
			- goal – encourage research in AI
- limits of lossless compression
	- encoding $f:$ {$n$-bit strings} → {strings of length $\lt n$}
		- $|\text{Dom}f|=2^n$
		- $|\text{Im}f|\leq 2^n-1$
		- such $f$ cannot be injective → does not have an inverse function
	- let $M\subseteq\text{Dom} f$ such that $\forall s\in M:|f(s)|\leq 0.9n$
		- $f$ injective on $M\implies|M|\leq2^{1+0.9n}-1$
		- $n=100$ … $|M|/2^n\lt 2^{-9}$
		- $n=1000$ … $|M|/2^n\lt 2^{-99}\approx 1.578\cdot 10^{-30}$
	- we can't compress all data efficiently (but that does not matter, we often focus on specific instances of data)

## Statistical methods

- basic concepts
	- source alphabet … $A$
	- coding alphabet … $A_C$
	- encoding … $f:A^*\to A^*_C$
	- $f$ injective → uniquely decodable encoding
- typical strategy
	- input string is factorized into concatenation of substrings
		- $s=s_1s_2\dots s_k$
	- substrings = phrases
	- $f(s)=C(s_1)C(s_2)\dots C(s_k)$
		- $C(s_i)$ … codeword
- codes
	- source code … $K:A\to A^*_c$
	- $K(s)$ … codeword for symbol $s\in A$
	- encoding $K^*$ generated by code $K$ is the mapping $K^*(s_1s_2\dots s_n)=K(s_1)K(s_2)\dots K(s_n)$ for every $s\in A^*$
	- $K$ is uniquely decodable $\equiv$ generates a uniquely decodable encoding $K^*$
	- example
		- $\set{0,01,11}$ … is uniquely decodable
		- $\set{0,01,10}$ … is not, counterexample: $010$
		- $K_1$ … fixed length
		- $K_2$ … no codeword is a prefix of another one
- prefix codes
	- prefix code is a code such that no codeword is a prefix of another codeword
	- observation
		- every prefix code $K$ (over a binary alphabet) may be represented by a binary tree = prefix tree for $K$
		- prefix tree may be used for decoding
	- idea: some letters are more frequent, let's assign shorter codewords to them
- Shannon-Fano coding
	- input – source alphabet $A$, frequency $f(s)$ for every symbol $s\in A$
	- output – prefix code for $A$
	- algorithm
		- sort the source alphabet by symbol frequencies
		- divide the table into parts with similar sums of frequencies
		- append 0/1 to prefix, repeat
	- definition: the algorithm constructs an optimal prefix / uniquely decodable code $K\equiv$ $|K'^*(a_1a_2\dots a_n)|\geq |K^*(a_1a_2\dots a_n)|$ for every prefix (uniquely decodable) code $K'$ and for every string $a\in A^*$ with symbol frequencies $f$
	- Shannon-Fano is not optimal
		- Fano's student David Huffman described a construction of optimal prefix code
	- divide and conquer algorithm

### Huffman code

- constructs the tree from the bottom
- starts with the symbols that have the lowest frequencies
- sum of the frequencies is used as the frequency of the new node
- greedy algorithm
- interesting type of Huffman code: [Canonical Huffman code](https://en.wikipedia.org/wiki/Canonical_Huffman_code)
- optimality of Huffman code
	- notation
		- alphabet $a$ of symbols that occur in the input string
		- $\forall s\in A:f(s)$ … frequency of the symbol
		- let $T$ be a prefix tree for alphabet $A$ with frequencies $f$
			- leaves of $T$ … symbols of $A$
		- let $d_T(s)$ denote the depth of a leaf $s$ in the tree $T$
			- $d_T(s)=$ length of the path from root to $s$
			- $d_T(s)=$ length of the codeword for symbols $s$
		- cost $C(T)$ of tree $T$
			- $C(T)=\sum_{z\in A}f(z)d_T(z)$
			- $C(T)$ … length of the encoded message
	- lemma 0: if a prefix tree $T$ ($|V(T)|\geq 3$) contains a vertex with exactly one child, then $T$ is not optimal
	- proof: we could contract the edge between the vertex and its child
	- lemma 1
		- let $A$ ($|A|\geq 2$) be an alphabet with frequencies $f$
		- let $x,y\in A$ be symbols with the lowest frequencies
		- then there exists an optimal prefix tree for $A$ in which $x,y$ are siblings of maximum depth
	- proof
		- let $T$ be an optimal tree where $x,y$ are not siblings of maximum depth (instead, there are $a,b$ with maximum depth) 
		- now by switching $a$ and $x$, we get $T'$
		- $C(T)-C(T')=d_T(x)f(x)+d_T(a)f(a)-d_{T'}(a)f(a)-d_{T'}(x)f(x)=$ $\dots\geq 0$
		- $T'$ is also optimal
		- we repeat the steps for $y$
		- we get $T''$
		- $T$ is optimal $\implies T''$ is optimal
	- lemma 2
		- let $A$ be an alphabet with $f$
		- let $x,y\in A$ be symbols with the lowest frequencies
		- let $T'$ be an optimal prefix tree for alphabet $A'=A\setminus\set{x,y}\cup\set{z}$ where $f(z)=f(x)+f(y)$
		- then tree $T$ obtained from $T'$ by adding children $x$ and $y$ to $z$ is an optimal prefix tree for $A$
	- proof
		- $T'$ optimal for $A'$
		- we get $T$ by adding $x,y$ below $z$ (therefore $z$ is no longer a symbol of the alphabet)
		- there exists $T''$ optimal for $A$ in which $x,y$ are siblings of maximum depth
			- using lemma 1
			- we label their parent $z$
		- we remove $x,y$ and get $T'''$ for $A'$
		- $C(T)=C(T')+f(x)+f(y)\leq C(T''')+f(x)+f(y)=C(T'')$
			- clearly $C(T')\leq C(T''')$ as $T'$ is optimal
		- $T''$ optimal $\implies T$ optimal
	- theorem: algorithm Huffman produces an optimal prefix code
		- proof by induction
	- theorem (Kraft-McMillan inequality)
		- codewords lengths $l_1,l_2,\dots,l_n$ of a uniquely decodable code satisfy $\sum_i 2^{-l_i}\leq 1$
		- conversely, if positive integers $l_1,l_2,\dots,l_n$ satisfy the inequality, then there is a prefix code with codeword lengths $l_1,l_2,\dots,l_n$
	- corollary
		- Huffman coding algorithm procudes an optimal uniquely decodable code
		- if there was some better uniquely decodable code, there would be also a better prefix code (by Kraft-McMillan inequality) so Huffman coding would not be optimal
- idea: we could encode larger parts of the message, it might be better than Huffman coding
- Huffman coding is not deterministic, there is not specified the order in which we take the symbols to form the tree – it may lead to optimal trees of different height
- can we modify Huffman algorithm to guarantee that the resulting code minimizes the maximum codeword length?
	- yes, when constructing the tree, we can sort the nodes not only by their frequencies but also by the depth of their subtree
- what is the maximum height of a Huffman tree for an input message of length $n$?
	- we find the minimal input size to get the Huffman tree of height $k$
	- Huffman tree of height $k\implies n\geq f_{k+1}$
		- where $f_0=1,\;f_1=1,\;f_{k+2}=f_{k+1}+f_k$
	- $n\lt f_{k+2}\implies$ max. codeword length $\leq k$
- is there an optimal prefix code which cannot be obtained using Huffman algorithm?
	- yes
- generalize the construction of binary Huffman code to the case of $m$-ary coding alphabet ($m\gt 2$)
	- the trivial approach does not produce an optimal solution
	- for the ternary coding alphabet, we can modify the first step – we will only take two leaves to form a subtree
- implementation notes
	- bitwise input and output (usually, we work with bytes)
	- overflow problem (for integers etc.)
	- code reconstruction necessary for decoding
- adaptive compression
	- static × adaptive methods
	- statistical data compression consists of two parts: modeling and coding
	- what if we cannot read the data twice?
		- we will use the adaptive model
- if we know that the code contains 2 codewords of length 2 and 4 codewords of length 3, we can construct canonical Huffman code (?)
	- 00, 01, 100, 101, 110, 111
- adaptive Huffman code
	- brute force strategy – we reconstruct the whole tree
		- after each frequency change
		- after reading $k$ symbols
		- after change of order of symbols sorted by frequencies
	- characterization of Huffman trees
		- Huffman tree – binary tree with nonnegative vertex weights
			- two vertices of a binary tree with the same parent are called siblings
			- binary tree with nonnegative vertex weights has a sibling property if
				- $\forall$ parent: weight(parent) = $\sum$ weight(child)
				- each vertex except root has a sibling
				- vertices can be listed in order of non-increasing weight with each vertex adjacent to its sibling
		- theorem (Faller 1973): binary tree with nonnegative vertex weights is a Huffman tree iff it has the sibling property
		- FGK algorithm
			- we maintain the sibling property
		- zero-frequency problem
			- how to encode a novel symbol
			- 1st solution: initializace Huffman tree with all symbols of the source alphabet, each with weight one
			- 2nd solution: initialize Huffman tree with a special symbol esc
				- encode the 1st occurence of symbol s as a Huffman code of esc followed by s
				- insert a new leaf representing s into the Huffman tree
		- average codeword length … $l_{FGK}\leq l_H+O(1)$
	- implementation problems
		- overflow
			- we can multiply frequencies by $\frac12$
				- this may lead to tree reconstruction
			- statistical properties of the input may be changing over time – it might be useful to reduce the influence of older frequencies in time
				- by multiplying by $x\in(0,1)$
			- we'll use integral arithmetic → we need to scale the numbers

## Arithmetic coding

- each string $s\in A^*$ associate with a subinterval $I_s\subseteq[0,1]$ such that
	- $s\neq s'\implies I_s\cap I_{s'}=\emptyset$
	- length of $I_s=p(s)$
	- $C(s)=$ number from $I_s$ with the shortest code possible
- we'll first assume that $p(a_1a_2\dots a_m)=p(a_1)\cdot p(a_2)\cdot \ldots \cdot p(a_m)$
- encoding
	- we count the frequencies → probabilities
	- we sort the symbols in some order
	- we assign intervals to the symbols in the order
	- to encode the message, we subdivide the intervals
		- for example, let's say that $I_B=(0.2,0.3)$ and $I_I=(0.5,0.6)$
		- then $I_{BI}=(0.25,0.26)$
	- the shorter the message, the longer the interval
	- we have to store the length of the input or encode a special symbol (end of file) to be able to distinguish between the encoded string and its prefixes
- decoding
	- we reconstruct the input string from the left
	- we inverse the mapping
- problem: using floating point arithmetic leads to rounding errors → we will use integers
	- underflow may happen that way if we use too short integers
- another problem
	- we use bit shifting and assume that the first bits of both bounds are the same
	- what if the first bits differ?