Browsing by Author "Kärkkäinen, Juha"

Now showing 1 - 3 of 3

Block Trees
(ELSEVIER, 2021-05) Belazzougui, Djamal; Cáceres, Manuel; Gagie, Travis; Gawrychowski, Paweł; Kärkkäinen, Juha; Navarro, Gonzalo; Ordóñez, Alberto; Puglisi, Simon J.; Tabei, Yasuo
Let string S[1..n] be parsed into z phrases by the Lempel-Ziv algorithm. The corresponding compression algorithm encodes S in 𝒪(z) space, but it does not support random access to S. We introduce a data structure, the block tree, that represents S in 𝒪(z log(n/z)) space and extracts any symbol of T in time 𝒪(log(n/z)), among other space-time tradeoffs. By multiplying the space by the alphabet size, we also support rank and select queries, which are useful for building compressed data structures on top of S. Further, block trees can be built in a scalable manner. Our experiments show that block trees offer relevant space-time tradeoffs compared to other compressed string representations for highly repetitive strings.
Lempel-Ziv Decoding in External Memory
(Springer International Publishing, 2016-06-01) Belazzougui, Djamal; Kärkkäinen, Juha; Kempa, Dominik; Puglisi, Simon J.
Simple and fast decoding is one of the main advantages of LZ77-type text encoding used in many popular file compressors such as gzip and 7zip. With the recent introduction of external memory algorithms for Lempel–Ziv factorization there is a need for external memory LZ77 decoding but the standard algorithm makes random accesses to the text and cannot be trivially modified for external memory computation. We describe the first external memory algorithms for LZ77 decoding, prove that their I/O complexity is optimal, and demonstrate that they are very fast in practice, only about three times slower than in-memory decoding (when reading input and writing output is included in the time).
Linear-time string indexing and analysis in small space
(Association for Computing Machinery, 2020-03-09) Belazzougui, Djamal; Cunial, Fabio; Kärkkäinen, Juha; Mäkinen, Veli
The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. For example, one can compare two genomes by building a common index for their concatenation, and by detecting common substructures by querying the index. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis. We show that the BWT of a string T ∈ {1, . . . ,σ }^n can be built in deterministic O (n) time using just O (n log σ ) bits of space, where σ ≤ n. Deterministic linear time is achieved by exploiting a new partial rank data structure that supports queries in constant time, and that might have independent interest. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of T . Many fundamental string analysis problems, such as maximal repeats, maximal unique matches, and string kernels, can be mapped to such enumeration, and can thus be solved in deterministic O (n) time and in O (n log σ ) bits of space from the input string, by tailoring the enumeration algorithm to some problem-specific computations. We also show how to build many of the existing indexes based on the BWT, such as the compressed suffix array, the compressed suffix tree, and the bidirectional BWT index, in randomized O (n) time and in O (n log σ ) bits of space. The previously fastest construction algorithms for BWT, compressed suffix array and compressed suffix tree, which used O (n log σ ) bits of space, took O (n log log σ ) time for the first two structures, and O (n log^ϵ n) time for the third, where ϵ is any positive constant smaller than one. Alternatively, the BWT could be previously built in linear time if one was willing to spend O (n log σ log log_σ n) bits of space. Contrary to the state of the art, our bidirectional BWT index supports every operation in constant time per element in its output.