Return to search

USING SUFFIX ARRAYS FOR LEMPEL-ZIV DATA COMPRESSION

<p>In the 1970s, Abraham Lempel and Jacob Ziv developed the first dictionary-based compression methods (LZ77 and LZ78). Their ideas have been a wellspring of inspiration to many researchers , who generalized, improved and combined them with run-length encoding (RLE) and statistical methods to form many commonly used lossless compression methods for text, image and sounds.</p> <p>The proposed methods factor a string x into substrings (factors) in such a way as to facilitate encoding the string into a compressed form (lossless text compression). This LZ factorization , as it is commonly called today, became a fundamental data structure in string processing, especially valuable for string compression. Recently, it found applications in computing various "regularities" in strings.</p> <p>The main principle of LZ methods is to use a portion of the previously seen input string as the dictionary. LZ77 and LZ78 encoders differ in two aspects. The first aspect is that LZ77 uses a sliding window unlike LZ78 which uses the entire string for building the dictionary. The use of a sliding window in LZ77 makes its decoder much simpler and faster than the LZ78 decoder. This implies that LZ77 is valuable in cases where a file is compressed once (or just a few times) and is decompressed often. A rarely used archive of compressed files is a superb example. The other aspect is the format of the codewords. LZ77 codewords consist of three parts: position, length and first non-matching symbol, while LZ78 removes the need for the length of the match in the codeword since it is implied in the dictionary.</p> <p>A whole family of algorithms has stemmed out of the original LZ algorithms (LZ77 and LZ78). This was a result of an effort to improve upon the LZ encoding algorithm in terms of speed and compression ratio. Some of these variants involved the use of sophisticated data structures (e.g. suffix trees, binary search trees, etc) to hold the dictionary in order to boost the search time. The problem with such data structures is that the amount of memory required is variable and cannot be known in advance. Furthermore, some of these data structures require a substantial amount of memory. LZ is the basis of the gzip (Unix) , winzip and pkzip compression techniques.</p> <p>In the testing for [1], we scaled up an LZSS implementation due to Haruhiko Okumura [37] so as to be useful for regularities (N = n , the length of the whole input string, and F equal to the full length of the unfactored suffix). We found that the binary tree approach becomes uncompetitive with algorithms that use the suffix array (SA) approach for the LZ factorization of the whole string. This observation triggered us to scale down the SA approach.</p> <p>The main contribution of this thesis is a novel LZ77 variant (LZAS) that uses a suffix array (SA) to perform the searches. The SA is augmented with a very simple and efficient algorithm that alternates between searching left and right in SA to find the longest match. Suffix arrays have gained the attention of researchers in recent years due to their simplicity and low memory requirements. They solve the sub-string problem as efficiently as suffix trees, using less memory. One notable advantage of using SA in an LZ encoder is that the amount of memory is independent of the text to be searched and can be defined a priori. The low and predictable memory requirement of this approach makes it suitable for memory-critical applications such as embedded systems. Moreover, our experiments show that the processing time per letter is almost stable and hence we can predict the processing time for a file given its size. Our proposed algorithm can additionally be used for forward/ backward substring search.</p> <p>In this thesis we investigate three variants of the LZAS algorithm. The first two of these variants (i.e. LZAS1 and LZAS2) use a dynamic suffix array DSA. DSA is a suffix array that can be updated whenever a letter or a factor is edited (i.e. deleted/ inserted/substituted by another letter or factor) in the original string. The suffix array of a sliding window changes whenever the window slides. Hence, we use the DSA to make sure that t he suffix array is up to date. T he DSA can be compressed using a sampling technique; therefore we decided to experiment with both sampled and non-sampled DSA. The third variant (i. e. LZAS3) re-computes the suffix array instead of updating it. We use an implementation of a suffix array construction algorithm (SACA) that requires supralinear time [28] but performs well in practice. We tested these variants against each other in terms of time and space. We further experimented with various window sizes and noticed that re-computing SA becomes better than updating it using DSA when the window size is small (i .e. hundreds of bytes compared to thousands of bytes).</p> / Master of Science (MS)

Identiferoai:union.ndltd.org:mcmaster.ca/oai:macsphere.mcmaster.ca:11375/8934
Date08 1900
CreatorsAL-HAFIDH, ANISA
ContributorsSmyth, William F., Computing and Software
Source SetsMcMaster University
Detected LanguageEnglish
Typethesis

Page generated in 0.0078 seconds