this dir | view | cards | source | edit | dark
top
Data Compression Algorithms
Introduction
- data compression
- process of converting an input data stream into output data stream that has a smaller size
- compression algorithm = encoding (compression) + decoding (decompression)
- compression
- lossless – the restored and original data are identical
- lossy – the restored data are a reasonable approximation of the original
- methods
- static / adapative
- streaming / block
- block – we compress the data by blocks
- goals – to save storage, to reduce the transmission bandwidth
- measuring the performance
- units
- input data size … u bytes (uncompressed)
- compressed data size … k bytes, K bits
- compression ratio … k/u (written as percentage)
- compression factor … u:k
- compression gain … (u−k)/u (written as percentage)
- average codeword length … K/u (may be bpc or bpp)
- bpc … bits per char
- bpp … bits per pixel
- relative compression (percent log ratio) … 100ln(k/k′)
- k′ … size of data compressed by a standard algorithm
- data corpora
- Calgary Corpus – 14 files: text, graphics, binary files
- Canterbury Corpus – 11 files + artificial corpus (4) + large corpus (3) + miscellaneous corpus (1)
- Silesia Corpus
- Prague Corpus
- contests
- Calgary Corpus Compression Challenge
- Hutter Prize – Wikipedia compression
- goal – encourage research in AI
- limits of lossless compression
- encoding f: {n-bit strings} → {strings of length <n}
- ∣Domf∣=2n
- ∣Imf∣≤2n−1
- such f cannot be injective → does not have an inverse function
- let M⊆Domf such that ∀s∈M:∣f(s)∣≤0.9n
- f injective on M⟹∣M∣≤21+0.9n−1
- n=100 … ∣M∣/2n<2−9
- n=1000 … ∣M∣/2n<2−99≈1.578⋅10−30
- we can't compress all data efficiently (but that does not matter, we often focus on specific instances of data)
Statistical methods
- basic concepts
- source alphabet … A
- coding alphabet … AC
- encoding … f:A∗→AC∗
- f injective → uniquely decodable encoding
- typical strategy
- input string is factorized into concatenation of substrings
- s=s1s2…sk
- substrings = phrases
- f(s)=C(s1)C(s2)…C(sk)
- C(si) … codeword
- codes
- source code … K:A→Ac∗
- K(s) … codeword for symbol s∈A
- encoding K∗ generated by code K is the mapping K∗(s1s2…sn)=K(s1)K(s2)…K(sn) for every s∈A∗
- K is uniquely decodable ≡ generates a uniquely decodable encoding K∗
- example
- {0,01,11} … is uniquely decodable
- {0,01,10} … is not, counterexample: 010
- K1 … fixed length
- K2 … no codeword is a prefix of another one
- prefix codes
- prefix code is a code such that no codeword is a prefix of another codeword
- observation
- every prefix code K (over a binary alphabet) may be represented by a binary tree = prefix tree for K
- prefix tree may be used for decoding
- idea: some letters are more frequent, let's assign shorter codewords to them
- Shannon-Fano coding
- input – source alphabet A, frequency f(s) for every symbol s∈A
- output – prefix code for A
- algorithm
- sort the source alphabet by symbol frequencies
- divide the table into parts with similar sums of frequencies
- append 0/1 to prefix, repeat
- definition: the algorithm constructs an optimal prefix / uniquely decodable code K≡ ∣K′∗(a1a2…an)∣≥∣K∗(a1a2…an)∣ for every prefix (uniquely decodable) code K′ and for every string a∈A∗ with symbol frequencies f
- Shannon-Fano is not optimal
- Fano's student David Huffman described a construction of optimal prefix code
- divide and conquer algorithm
Huffman code
- constructs the tree from the bottom
- starts with the symbols that have the lowest frequencies
- sum of the frequencies is used as the frequency of the new node
- greedy algorithm
- interesting type of Huffman code: Canonical Huffman code
- optimality of Huffman code
- notation
- alphabet a of symbols that occur in the input string
- ∀s∈A:f(s) … frequency of the symbol
- let T be a prefix tree for alphabet A with frequencies f
- leaves of T … symbols of A
- let dT(s) denote the depth of a leaf s in the tree T
- dT(s)= length of the path from root to s
- dT(s)= length of the codeword for symbols s
- cost C(T) of tree T
- C(T)=∑z∈Af(z)dT(z)
- C(T) … length of the encoded message
- lemma 0: if a prefix tree T (∣V(T)∣≥3) contains a vertex with exactly one child, then T is not optimal
- proof: we could contract the edge between the vertex and its child
- lemma 1
- let A (∣A∣≥2) be an alphabet with frequencies f
- let x,y∈A be symbols with the lowest frequencies
- then there exists an optimal prefix tree for A in which x,y are siblings of maximum depth
- proof
- let T be an optimal tree where x,y are not siblings of maximum depth (instead, there are a,b with maximum depth)
- now by switching a and x, we get T′
- C(T)−C(T′)=dT(x)f(x)+dT(a)f(a)−dT′(a)f(a)−dT′(x)f(x)= ⋯≥0
- T′ is also optimal
- we repeat the steps for y
- we get T′′
- T is optimal ⟹T′′ is optimal
- lemma 2
- let A be an alphabet with f
- let x,y∈A be symbols with the lowest frequencies
- let T′ be an optimal prefix tree for alphabet A′=A∖{x,y}∪{z} where f(z)=f(x)+f(y)
- then tree T obtained from T′ by adding children x and y to z is an optimal prefix tree for A
- proof
- T′ optimal for A′
- we get T by adding x,y below z (therefore z is no longer a symbol of the alphabet)
- there exists T′′ optimal for A in which x,y are siblings of maximum depth
- using lemma 1
- we label their parent z
- we remove x,y and get T′′′ for A′
- C(T)=C(T′)+f(x)+f(y)≤C(T′′′)+f(x)+f(y)=C(T′′)
- clearly C(T′)≤C(T′′′) as T′ is optimal
- T′′ optimal ⟹T optimal
- theorem: algorithm Huffman produces an optimal prefix code
- theorem (Kraft-McMillan inequality)
- codewords lengths l1,l2,…,ln of a uniquely decodable code satisfy ∑i2−li≤1
- conversely, if positive integers l1,l2,…,ln satisfy the inequality, then there is a prefix code with codeword lengths l1,l2,…,ln
- corollary
- Huffman coding algorithm procudes an optimal uniquely decodable code
- if there was some better uniquely decodable code, there would be also a better prefix code (by Kraft-McMillan inequality) so Huffman coding would not be optimal
- idea: we could encode larger parts of the message, it might be better than Huffman coding
- Huffman coding is not deterministic, there is not specified the order in which we take the symbols to form the tree – it may lead to optimal trees of different height
- can we modify Huffman algorithm to guarantee that the resulting code minimizes the maximum codeword length?
- yes, when constructing the tree, we can sort the nodes not only by their frequencies but also by the depth of their subtree
- what is the maximum height of a Huffman tree for an input message of length n?
- we find the minimal input size to get the Huffman tree of height k
- Huffman tree of height k⟹n≥fk+1
- where f0=1,f1=1,fk+2=fk+1+fk
- n<fk+2⟹ max. codeword length ≤k
- is there an optimal prefix code which cannot be obtained using Huffman algorithm?
- generalize the construction of binary Huffman code to the case of m-ary coding alphabet (m>2)
- the trivial approach does not produce an optimal solution
- for the ternary coding alphabet, we can modify the first step – we will only take two leaves to form a subtree
- implementation notes
- bitwise input and output (usually, we work with bytes)
- overflow problem (for integers etc.)
- code reconstruction necessary for decoding
- adaptive compression
- static × adaptive methods
- statistical data compression consists of two parts: modeling and coding
- what if we cannot read the data twice?
- we will use the adaptive model
- if we know that the code contains 2 codewords of length 2 and 4 codewords of length 3, we can construct canonical Huffman code (?)
- 00, 01, 100, 101, 110, 111
- adaptive Huffman code
- brute force strategy – we reconstruct the whole tree
- after each frequency change
- after reading k symbols
- after change of order of symbols sorted by frequencies
- characterization of Huffman trees
- Huffman tree – binary tree with nonnegative vertex weights
- two vertices of a binary tree with the same parent are called siblings
- binary tree with nonnegative vertex weights has a sibling property if
- ∀ parent: weight(parent) = ∑ weight(child)
- each vertex except root has a sibling
- vertices can be listed in order of non-increasing weight with each vertex adjacent to its sibling
- theorem (Faller 1973): binary tree with nonnegative vertex weights is a Huffman tree iff it has the sibling property
- FGK algorithm
- we maintain the sibling property
- zero-frequency problem
- how to encode a novel symbol
- 1st solution: initializace Huffman tree with all symbols of the source alphabet, each with weight one
- 2nd solution: initialize Huffman tree with a special symbol esc
- encode the 1st occurence of symbol s as a Huffman code of esc followed by s
- insert a new leaf representing s into the Huffman tree
- average codeword length … lFGK≤lH+O(1)
- implementation problems
- overflow
- we can multiply frequencies by 21
- this may lead to tree reconstruction
- statistical properties of the input may be changing over time – it might be useful to reduce the influence of older frequencies in time
- by multiplying by x∈(0,1)
- we'll use integral arithmetic → we need to scale the numbers
Arithmetic coding
- each string s∈A∗ associate with a subinterval Is⊆[0,1] such that
- s=s′⟹Is∩Is′=∅
- length of Is=p(s)
- C(s)= number from Is with the shortest code possible
- we'll first assume that p(a1a2…am)=p(a1)⋅p(a2)⋅…⋅p(am)
- encoding
- we count the frequencies → probabilities
- we sort the symbols in some order
- we assign intervals to the symbols in the order
- to encode the message, we subdivide the intervals
- for example, let's say that IB=(0.2,0.3) and II=(0.5,0.6)
- then IBI=(0.25,0.26)
- the shorter the message, the longer the interval
- we have to store the length of the input or encode a special symbol (end of file) to be able to distinguish between the encoded string and its prefixes
- decoding
- we reconstruct the input string from the left
- we inverse the mapping
- problem: using floating point arithmetic leads to rounding errors → we will use integers
- underflow may happen that way if we use too short integers
- another problem
- we use bit shifting and assume that the first bits of both bounds are the same
- what if the first bits differ?