# Exam ## Introduction - data compression - process of converting an input data stream into output data stream that has a smaller size - compression algorithm = encoding (compression) + decoding (decompression) - compression - lossless – the restored and original data are identical - lossy – the restored data are a reasonable approximation of the original - methods - static / adapative - streaming / block - block – we compress the data by blocks - goals – to save storage, to reduce the transmission bandwidth - measuring the performance - units - input data size … $u$ bytes (uncompressed) - compressed data size … $k$ bytes, $K$ bits - compression ratio … $k/u$ (written as percentage) - compression factor … $u:k$ - compression gain … $(u-k)/u$ (written as percentage) - average codeword length … $K/u$ (may be bpc or bpp) - bpc … bits per char - bpp … bits per pixel - relative compression (percent log ratio) … $100\ln(k/k')$ - $k'$ … size of data compressed by a standard algorithm - data corpora - Calgary Corpus – 14 files: text, graphics, binary files - Canterbury Corpus – 11 files + artificial corpus (4) + large corpus (3) + miscellaneous corpus (1) - Silesia Corpus - Prague Corpus - contests - Calgary Corpus Compression Challenge - Hutter Prize – Wikipedia compression - goal – encourage research in AI - limits of lossless compression - encoding $f:$ {$n$-bit strings} → {strings of length $\lt n$} - $|\text{Dom}f|=2^n$ - $|\text{Im}f|\leq 2^n-1$ - such $f$ cannot be injective → does not have an inverse function - let $M\subseteq\text{Dom} f$ such that $\forall s\in M:|f(s)|\leq 0.9n$ - $f$ injective on $M\implies|M|\leq2^{1+0.9n}-1$ - $n=100$ … $|M|/2^n\lt 2^{-9}$ - $n=1000$ … $|M|/2^n\lt 2^{-99}\approx 1.578\cdot 10^{-30}$ - we can't compress all data efficiently (but that does not matter, we often focus on specific instances of data) ## Lossless Compression ### Statistical Methods - basic concepts - source alphabet … $A$ - coding alphabet … $A_C$ - encoding … $f:A^*\to A^*_C$ - $f$ injective → uniquely decodable encoding - typical strategy - input string is factorized into concatenation of substrings - $s=s_1s_2\dots s_k$ - substrings = phrases - $f(s)=C(s_1)C(s_2)\dots C(s_k)$ - $C(s_i)$ … codeword - codes - (source) code … $K:A\to A^*_c$ - $K(s_i)$ … codeword for symbol $s_i\in A$ - encoding $K^*$ generated by code $K$ is the mapping $K^*(s_1s_2\dots s_n)=K(s_1)K(s_2)\dots K(s_n)$ for every $s\in A^*$ - $K$ is uniquely decodable $\equiv$ it generates a uniquely decodable encoding $K^*$ - example - $\set{0,01,11}$ … is uniquely decodable - $\set{0,01,10}$ … is not, counterexample: $010$ - $K_1$ … fixed length - $K_2$ … no codeword is a prefix of another one - prefix codes - prefix code is a code such that no codeword is a prefix of another codeword - observation - every prefix code $K$ (over a binary alphabet) may be represented by a binary tree = prefix tree for $K$ - prefix tree may be used for decoding - idea: some letters are more frequent, let's assign shorter codewords to them - historical example – Morse code - Shannon-Fano coding - input – source alphabet $A$, frequency $f(s)$ for every symbol $s\in A$ - output – prefix code for $A$ - algorithm - sort the source alphabet by symbol frequencies - divide the table into parts with similar sums of frequencies - append 0/1 to prefix, repeat - divide and conquer algorithm - definition: the algorithm constructs an optimal prefix / uniquely decodable code $K\equiv$ $|K'^*(a_1a_2\dots a_n)|\geq |K^*(a_1a_2\dots a_n)|$ for every prefix (uniquely decodable) code $K'$ and for every string $a\in A^*$ with symbol frequencies $f$ - Shannon-Fano is not optimal - Fano's student David Huffman described a construction of optimal prefix code ### Huffman Code - algorithm - constructs the tree from the bottom - starts with the symbols that have the lowest frequencies - sum of the frequencies is used as the frequency of the new node - greedy algorithm - interesting type of Huffman code: [Canonical Huffman code](https://en.wikipedia.org/wiki/Canonical_Huffman_code) - optimality of Huffman code - notation - alphabet $A$ of symbols that occur in the input string - $\forall s\in A:f(s)$ … frequency of the symbol - let $T$ be a prefix tree for alphabet $A$ with frequencies $f$ - leaves of $T$ … symbols of $A$ - let $d_T(s)$ denote the depth of a leaf $s$ in the tree $T$ - $d_T(s)=$ length of the path from root to $s$ - $d_T(s)=$ length of the codeword for symbols $s$ - cost $C(T)$ of tree $T$ - $C(T)=\sum_{z\in A}f(z)d_T(z)$ - $C(T)$ … length of the encoded message - lemma 0: if a prefix tree $T$ ($|V(T)|\geq 3$) contains a vertex with exactly one child, then $T$ is not optimal - proof: we could contract the edge between the vertex and its child - lemma 1 - let $A$ ($|A|\geq 2$) be an alphabet with frequencies $f$ - let $x,y\in A$ be symbols with the lowest frequencies - then there exists an optimal prefix tree for $A$ in which $x,y$ are siblings of maximum depth - proof - let $T$ be an optimal tree where $x,y$ are not siblings of maximum depth (instead, there are $a,b$ with maximum depth) - now by switching $a$ and $x$, we get $T'$ - $C(T)-C(T')=d_T(x)f(x)+d_T(a)f(a)-d_{T'}(a)f(a)-d_{T'}(x)f(x)=$ $\dots\geq 0$ - $T'$ is also optimal - we repeat the steps for $y$ - we get $T''$ - $T$ is optimal $\implies T''$ is optimal - lemma 2 - let $A$ be an alphabet with $f$ - let $x,y\in A$ be symbols with the lowest frequencies - let $T'$ be an optimal prefix tree for alphabet $A'=A\setminus\set{x,y}\cup\set{z}$ where $f(z)=f(x)+f(y)$ - then tree $T$ obtained from $T'$ by adding children $x$ and $y$ to $z$ is an optimal prefix tree for $A$ - proof - $T'$ optimal for $A'$ - we get $T$ by adding $x,y$ below $z$ (therefore $z$ is no longer a symbol of the alphabet) - there exists $T''$ optimal for $A$ in which $x,y$ are siblings of maximum depth - using lemma 1 - we label their parent $z$ - we remove $x,y$ and get $T'''$ for $A'$ - $C(T)=C(T')+f(x)+f(y)\leq C(T''')+f(x)+f(y)=C(T'')$ - clearly $C(T')\leq C(T''')$ as $T'$ is optimal - $T''$ optimal $\implies T$ optimal - theorem: algorithm Huffman produces an optimal prefix code - proof by induction - theorem (Kraft-McMillan inequality) - codewords lengths $l_1,l_2,\dots,l_n$ of a uniquely decodable code satisfy $\sum_i 2^{-l_i}\leq 1$ - conversely, if positive integers $l_1,l_2,\dots,l_n$ satisfy the inequality, then there is a prefix code with codeword lengths $l_1,l_2,\dots,l_n$ - corollary - Huffman coding algorithm procudes an optimal uniquely decodable code - if there was some better uniquely decodable code, there would be also a better prefix code (by Kraft-McMillan inequality) so Huffman coding would not be optimal - idea: we could encode larger parts of the message, it might be better than Huffman coding - Huffman coding is not deterministic, there is not specified the order in which we take the symbols to form the tree – it may lead to optimal trees of different height - can we modify Huffman algorithm to guarantee that the resulting code minimizes the maximum codeword length? - yes, when constructing the tree, we can sort the nodes not only by their frequencies but also by the depth of their subtree - what is the maximum height of a Huffman tree for an input message of length $n$? - we find the minimal input size to get the Huffman tree of height $k$ - how to make the tree as deep as possible? - the frequencies of the symbols will be a Fibonacci sequence - Huffman tree of height $k\implies n\geq f_{k+1}$ - where $f_0=1,\;f_1=1,\;f_{k+2}=f_{k+1}+f_k$ - $f_m$ … frequency of the $m$-th symbol - $n\lt f_{k+2}\implies$ max. codeword length $\leq k$ - there is an optimal prefix code which cannot be obtained using Huffman algorithm - generalize the construction of binary Huffman code to the case of $m$-ary coding alphabet ($m\gt 2$) - the trivial approach does not produce an optimal solution - for the ternary coding alphabet, we can modify the first step – we will only take two leaves to form a subtree - implementation notes - bitwise input and output (usually, we work with bytes) - overflow problem (for integers etc.) - code reconstruction necessary for decoding --- - adaptive compression - static × adaptive methods - statistical data compression consists of two parts: modeling and coding - what if we cannot read the data twice? - we will use the adaptive model - if we know that the code contains 2 codewords of length 2 and 4 codewords of length 3, we can construct canonical Huffman code - 00, 01, 100, 101, 110, 111 - adaptive Huffman code - brute force strategy – we reconstruct the whole tree - after each frequency change - after reading $k$ symbols - after change of order of symbols sorted by frequencies - characterization of Huffman trees - Huffman tree – binary tree with nonnegative vertex weights - two vertices of a binary tree with the same parent are called siblings - binary tree with nonnegative vertex weights has a sibling property if - $\forall$ parent: weight(parent) = $\sum$ weight(child) - each vertex except root has a sibling - vertices can be listed in order of non-increasing weight with each vertex adjacent to its sibling - theorem (Faller 1973): binary tree with nonnegative vertex weights is a Huffman tree iff it has the sibling property - FGK algorithm - we maintain the sibling property - zero-frequency problem - how to encode a novel symbol - 1st solution: initializace Huffman tree with all symbols of the source alphabet, each with weight one - 2nd solution: initialize Huffman tree with a special symbol esc - encode the 1st occurence of symbol s as a Huffman code of esc followed by s - insert a new leaf representing s into the Huffman tree - average codeword length … $l_{FGK}\leq l_H+O(1)$ - implementation problems - overflow - we can multiply frequencies by $\frac12$ - this may lead to tree reconstruction - statistical properties of the input may be changing over time – it might be useful to reduce the influence of older frequencies in time - by multiplying by $x\in(0,1)$ - we'll use integral arithmetic → we need to scale the numbers ### Arithmetic Coding - each string $s\in A^*$ associate with a subinterval $I_s\subseteq[0,1]$ such that - $s\neq s'\implies I_s\cap I_{s'}=\emptyset$ - length of $I_s=p(s)$ - $C(s)=$ number from $I_s$ with the shortest code possible - we'll first assume that $p(a_1a_2\dots a_m)=p(a_1)\cdot p(a_2)\cdot \ldots \cdot p(a_m)$ - encoding - we count the frequencies → probabilities - we sort the symbols in some order - we assign intervals to the symbols in the order - to encode the message, we subdivide the intervals - for example, let's say that $I_B=(0.2,0.3)$ and $I_I=(0.5,0.6)$ - then $I_{BI}=(0.25,0.26)$ - the shorter the message, the longer the interval - we have to store the length of the input or encode a special symbol (end of file) to be able to distinguish between the encoded string and its prefixes - decoding - we reconstruct the input string from the left - we inverse the mapping - problem: using floating point arithmetic leads to rounding errors → we will use integers - underflow may happen that way if we use too short integers - another problem - we use bit shifting and assume that the first bits of both bounds are the same - what if the first bits differ? - we don't output anything yet and increment a counter - adaptive version - Fenwick tree - Fenwick frequency = prefix sum ### Theoretical Limits - Hartley's formula - minimum and average number of yes/no questions to find the answer - Shannon's formula - entropy … $H(X)=-\sum_{x\in R}p(x)\log p(x)$ - theorem: $H(X)\geq 0$ - lemma: Gibbs inequality (or, at least, it looks like it) - theorem: for a random variable $X$ with a finite range $R$, $|R|=n$, we have $H(X)\leq\log n$ - and $H(X)=\log n\iff \forall x\in R:p(x)=\frac1n$ - we plug it into the inequality … $q_i=\frac1n$ - entropy axioms - Kraft-McMillan theorem - codeword lengths $l_1,l_2,\dots$ of a uniquely decodable code $C$ satisfy $\sum_i 2^{-l_i}\leq 1$ - on the other hand, if natural numbers $l_1,l_2,\dots$ satisfy the inequality, then there is a prefix code with codewords of these lengths - theorem: let $C$ be a uniquely decodable code for a random variable $X$, then $H(X)\leq L(C)$ - entropy of a discrete random vector - $H(X)=nH(X_i)$ - for independent components with the same probability distribution - upper and lower bounds - analysis of arithmetic coding - $\overline{F(x)}$ … the midpoint of the interval $F(x-1),F(x)$ - problems - we assume that the encoded random variables are independent - we are working with a probabilistic model of our data, which is simplified too much - a better probabilistic model - we cannot use “unlimited” context – the probabilities would be too small - we will use the context of $k$ symbols - according to an experimental estimate - the entropy of English is 1.3 bits per symbol ### Context Methods - model of order $i$ … makes use of context of length $i$ - methods - fixed length contexts - combined - complete - partial - static, adaptive - Prediction by Partial Matching - uses arithmetic coding - we try to encode symbol $s$ in context of length $i$ - if the context is too large (the frequency is not larger than zero), we output the ESC symbol and decrease the context length - assumption: every symbol has non-zero frequency for some (small) context - data structure … trie (context tree) - where each node has an edge to its largest proper suffix, this simplifies adding of new leaves - it might not fit in the memory - to solve that, we can freeze the model (update frequencies, but stop adding new nodes) - or we can rebuild the model using only the recent history (not the whole file) - advanced data structure: DAWG (Directed Acyclic Word Graph) - PPMII - “information inheritance” - uses heuristics - became part of RAR - PAQ - arithmetic coding over binary alphabet with sophisticated context modelling - this algorithm has probably the best compression ratio - but it is very time and space consuming ### Integer Coding - unary code $\alpha$ - $\alpha(1)=1$ - $\alpha(i+1)=0\,\alpha(i)$ - clearly, $|\alpha(i)|=i$ - binary code $\beta$ - is not a prefix code! - we can use fixed length - or we can gradually increase the codeword length - Elias codes - binary code always starts with 1 - reduced binary code $\beta'$ … without the first 1 - $\gamma'$ … first, we encode the length of the $\beta$ code using $\alpha$, then we encode the $\beta'$ - $\gamma$ … we interleave the bits of $\alpha$ and $\beta$ - $\delta$ … instead of $\alpha$, we use $\gamma$ to encode the length of $\beta$ - we could construct other codes in a similar fashion but $\delta$ code suffices - for large numbers, it is shorter than $\gamma$ - observation - each probability distribution $p_1\geq p_2\geq\dots\geq p_n$ satisfies $\forall i\in[n]:p_i\leq \frac1i$ - … ### Dictionary Methods - idea - repeated phrases stored in a dictionary - phrase occurrence in the text → pointer - problems - construction of the optimal dictionary is NP-hard → greedy algorithm - dictionary is needed for decoding #### LZ77 - LZ77 - sliding window - search buffer - look-ahead buffer - search for the longest string $S$ such that - $S$ starts in the search buffer - $S$ matches the prefix of the string in look-ahead buffer - $S$ is encoded as a triple $\braket{i,j,z}$ where - $i$ … distance of the beginning of $S$ from the look-ahead buffer - $j=|S|$ - $z$ … the symbol following $S$ in the look-ahead buffer - slow compression, fast decompression - LZSS - triple $(i,j,z)$ replaced with $(i,j)$ or literal $z$ - bit indicator to distinguish one from another - if pair of pointers requires the same space as $p$ symbols, we use pairs to encode only phrases of length $\gt p$ (shorter phrases are simply copied to the output) - sliding window → cyclic buffer - possible implementation - codeword 16b, $|i|=11$, $|j|=5$ - 8 bit indicators in 1 byte - one byte describes the types of 8 following items - other variants - LZB - LZH - Deflate algorithm - by Phil Katz - originally for PKZIP 2 - LZSS + static Huffman coding - zip, gzip, jar, cab, png - search buffer size 32 kB - look-ahead buffer size 258 B - match length 3..258 → 256 options - for a smaller match, it uses literals - input divided into blocks - 3bit header - is it last block? (1b) - type of the block (2b) - 3 block types - no compression - fixed coding tables for Huffman code - Huffman code based on input data statistics - type 3 block - starts with two Huffman trees - the first tree for literals and match lengths - 0..255 for literals - 256 = end-of-block - 257..285 for match length (extra bits are used to distinguish between particular numbers) - the second tree for offset (again with extra bits) - hashing is used for string searching in search buffer - hash table stores string of length 3 - strings in linked lists sorted by their age - a parameter to limit the maximum list length - greedy strategy extension - first, look for a primary match - slide window one symbol to the right, find a secondary match - if better, encode as literal + secondary match - LZ77 disadvantages - limited outlook of the search window - longer window → better compression, more complex search #### LZ78 - LZ78 - sliding window, explicit dictionary - efficient data structure for storing the dictionary … trie - dictionary reconstruction is necessary when the dictionary space is exhausted - we can delete the whole dictionary - or delete only the phrases that occurred least frequently or have not occurred recently - we need to preserve the prefix property - if the dictionary contains $S$, it has to contain all prefixes of $S$ - LZW - we initialize the dictionary by the alphabet - we encode the longest prefix present in the dictionary - then, we append the next character and add it to the dictionary - when decoding, it may happen that the item is not in the dictionary yet - because the encoder is one step ahead - so we use the last decoded string and append its first character - LZC - compress 4.0 utility in Unix - when maximum size was reached, it continued with static dictionary - when compression ratio started getting worse, it deleted the dictionary and started over with the initial setting - we use a special symbol for that - LZMW - new phrase = concatenation of two last ones - dictionary lacks prefix property - LZAP - instead of ST add all substrings Sp where p is a prefix of T ### Lossless Image Compression - digital images - raster (bitmap) - vector - color spaces - RGB - CMYK - HSV - YUV – for backwards compatibility of the TV broadcast - representation of bitmap images - monochrome - shades of gray - color - pixel = $(r, g, b)$ - alpha channel - color palette - conversion table (pixel = pointer into the table) - GIF - Graphics Interchange Format - originally for image transmission over phone lines - sharp-edged line art with a limited number of colors - color palette, max. 256 colors - more images in one file - interlacing scheme for transmission - rough image after 25–50 % data - we can stop the transmission if it's not the image we wanted - LZW-based compression - actually, it uses LZC - pointers … binary code with increasing length - blocks … up to 255 B - we can increase the number of possible colors by stacking multiple images on top of another (and using transparent pixels) - PNG - Portable Network Graphics - motivation: replace GIF with a format based on a non-patented method (LZW was patented by Sperry Corporation / Unisys) - color space - gray scale - true color - palette - alpha channel - no support for CMYK - compression has two phases - preprocessing - dictionary compression - interlacing schemes ### Burrows-Wheeler Transform - BSTW - *move to front* heuristic to restructure the dictionary - Burrows-Wheeler transform - we are looking for a permutation of string $x$ such that the similar symbols are close to each other - matrix of cyclic shifts of $x$ - sort rows lexicographically - the desired permutation is the last column - also, we need to store a number of the row where the original string lies in the matrix to be able to restore the original - if we used the first column, it would not be decodable - encoding - MTF (move to front) heuristic - for each symbol, its code would be the number of preceding symbols in the dictionary - then, move the symbol to the front of the dictionary - therefore, the number we use to encode the symbol matches the time that elapsed from the last encoding of that symbol (if it was encoded previously) - RLE … run length encoding - instead of $n$ zeroes, we can just encode the zero and length of the block - decoding - how to obtain the original string from the last column and the number? - we sort the last column to get the first one - we can determine the penultimate symbols from the pairs of first and last – thanks to the cyclic shift - what if there are multiple lines that and with the same symbol? we know that the lines were sorted lexicographically - bzip2 - matrix $n×n$ is not neccessary - we can use suffix array - why does BW transform work? - two same letters often share very similar right context - therefore they will be close in the rightmost column ## Lossy Compression - no standard corpora exist - how to measure the loss of information - MSE - signal to noise ratio - peak signal to noise ratio ### Digitizing Sound - sound is a physical disturbance in a medium, it propagates as a pressure wave - Fourier theorem - capturing audio - sampling - Nyquist-Shannon sampling theorem - the sampling frequency should be at least twice the highest frequency we want to record - aliasing = zkreslení - příklad: kola dostavníku ve filmu se točí pozpátku - quantization - we have $B$ bits available - we split the whole interval into $2^B$ subintervals - signal to noise ratio - audio processing - amplifier - band-pass filter - sampler … sampling - A/D converter … quantization - RAM - quantization - large set → smaller set - quantizer can be scalar or vector - quantization consists of two mapping – encoding and decoding - central limit theorem - uniform quantizer - midrise - midtread - forward adaptive quantizer (offline) - split input into blocks - analyze each separately, set parameters accordingly - parameters are transmitted as side information - backward adaptive quantizer (online) - no side info necessary - Jayant quantizer - multipliers - intervals can shrink and expand depending on the input - nonuniform quantization - $y_j$ … centroid (see the slides) - Lloyd-Max algorithm - companding - compressing + expanding - first we stretch the high probability regions close to the origin & compress the regions with low probability - then apply uniform quantization - expand afterwards (transform quantized values to the original scale) - $\mu$-law - A-law - G.711 ### Vector Quantization - vector quantization - we split the file into vectors - codebook - code-vectors selected to be represented - each one has a binary index - encoder - finds the code-vector closest to the source vector - writes its index - decoder - based on the index, it finds the corresponding code vector - how to measure the distances between code vectors and source vectors? - Euclidean distance - but we can omit the square root as we only compare the distances - k-means - LBG algorithm - initialization - perturbation - random initialization - pairwise nearest neighbor - tree-structured vector quantization - lattice vector quantizers - D lattices - $D_2$ lattice (in two-dimensional space) - $a\cdot v_1+b\cdot v_2=a(0,2)+b(1,1)=(b,2a+b)$ - the sum of the coordinates is always even - A lattices - G.719 - classifid vector quantization - idea: divide source vectors into classes with different properties, perform quantization separately - example: edge/nonedge regions (in an image) - applications - audio: G.719, CELP, Vorbis, TwinVQ - video: QuickTime, VQA ### Differential Encoding - differential encoding - idea: consecutive samples are quite similar (that holds especially for speech and images) - if we quantize the differences directly, the error will accumulate - instead of that, we will quantize the difference between the previous **predicted** value and the current value - DPCM - Wiener-Hopf equations - ADPCM - forward adaptive prediction … we need to transfer coefficients - backward adaptive prediction … no need to transfer additional info - … ### Transform Coding - transform coding - for images, we have to make a two-dimensional transform - it is better to transform each dimension separately - orthogonal transform … we use orthogonal transform matrix - gain $G_T$ of a transform $T$ - Karhunen-Loéve transform - discrete cosine transform (no need to memorize the formulae) - DCT is the cosine component of DFT - redistributes information so that we can quantize it more easily - discrete sine transform - Hadamard matrix - discrete Walsh-Hadamard transform - quantization and coding - JPEG - … - WebP - color space: RGB → YCbCr - chroma subsampling 4:2:0 - split into macroblocks 16×16 Y, 8×8 Cb, 8×8 Cr - high detail areas: 4×4 - macroblock prediction - transform: DCT, DWHT - quantization - coding ### Subband Coding - example - rapid sample-to-sample variation - but the long-term trend varies slowly - idea: let's average the samples using the sliding window - … ### Video Compression - video … time sequence of images (frames) - can we reduce the problem to image compression? - no! - if we change the average intensity of pixels, it becomes noticeable in a video - on the other hand, we don't care about the reproduction of sharp edges that much - basic types of algorithms - two-way communication → symmetric compression / decompression, it needs to be fast, videotelephony etc. - one-way → used for storage, the compression can be more complex - video signal representation - CRT - composite signal – YCbCr - Gamma correction – to make colors look more natural - MPEG-1 - we need to follow Nyquist-Shannon rule to prevent aliasing - horizontal filter removes the frequencies that break the rule - H.261 - videotelephony, videoconferencing - why do we have to do inverse quantization and transform when encoding? to prevent loss, we need the encoding prediction to be based on the information available to the decoder (see differential encoding) - motion compensation - which parts of the image are moving? - for each block of the frame, we are going to find the most similar block of the previous frame - if there's no similar previous block, we just encode the current block - we can reduce the search space - solution: macroblocks - the loom filter - sharp edges in the block used for prediction have some negative consequences - we use a smoothing filter - transform - discrete cosine transform of the blocks - coder simulates the decoder - quantization - coding - rate control - MPEG