Browsing by Author "Puglisi, Simon J."

Now showing 1 - 6 of 6

Bidirectional Variable-Order de Bruijn Graphs
(World Scientific Publishing, 2018-12) Belazzougui, Djamal; Gagie, Travis; Mäkinen, Veli; Previtali, Marco; Puglisi, Simon J.
Compressed suffix trees and bidirectional FM-indexes can store a set of strings and support queries that let us explore the set of substrings they contain, adding and deleting characters on both the left and right, but they can use much more space than a de Bruijn graph for the strings. Bowe et al.’s BWT-based de Bruijn graph representation (Proc. Workshop on Algorithms for Bioinformatics, pp. 225–235, 2012) can be made bidirectional as well, at the cost of increasing its space usage by a small constant, but it fixes the length of the substrings. Boucher et al. (Proc. Data Compression Conference, pp. 383–392, 2015) generalized Bowe et al.’s representation to support queries about variable-length substrings, but at the cost of bidirectionality. In this paper we show how to make Boucher et al.’s variable-order implementation of de Bruijn graphs bidirectional.
Bidirectional Variable-Order de Bruijn Graphs
(Springer International Publishing, 2016-03-22) Belazzougui, Djamal; Gagie, Travis; Mäkinen, Veli; Previtali, Marco; Puglisi, Simon J.
Implementing de Bruijn graphs compactly is an important problem because of their role in genome assembly. There are currently two main approaches, one using Bloom filters and the other using a kind of Burrows-Wheeler Transform on the edge labels of the graph. The second representation is more elegant and can even handle many graph-orders at once, but it does not cleanly support traversing edges backwards or inserting new nodes or edges. In this paper we resolve the first of these issues and partially address the second.
Block Trees
(ELSEVIER, 2021-05) Belazzougui, Djamal; Cáceres, Manuel; Gagie, Travis; Gawrychowski, Paweł; Kärkkäinen, Juha; Navarro, Gonzalo; Ordóñez, Alberto; Puglisi, Simon J.; Tabei, Yasuo
Let string S[1..n] be parsed into z phrases by the Lempel-Ziv algorithm. The corresponding compression algorithm encodes S in 𝒪(z) space, but it does not support random access to S. We introduce a data structure, the block tree, that represents S in 𝒪(z log(n/z)) space and extracts any symbol of T in time 𝒪(log(n/z)), among other space-time tradeoffs. By multiplying the space by the alphabet size, we also support rank and select queries, which are useful for building compressed data structures on top of S. Further, block trees can be built in a scalable manner. Our experiments show that block trees offer relevant space-time tradeoffs compared to other compressed string representations for highly repetitive strings.
Computing the Antiperiod(s) of a String
(Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, 2019-06-18) Alamro, Hayam; Badkobeh, Golnaz; Belazzougui, Djamal; Iliopoulos, Costas S.; Puglisi, Simon J.
A string S[1, n] is a power (or repetition or tandem repeat) of order k and period n/k, if it can be decomposed into k consecutive identical blocks of length n/k. Powers and periods are fundamental structures in the study of strings and algorithms to compute them efficiently have been widely studied. Recently, Fici et al. (Proc. ICALP 2016) introduced an antipower of order k to be a string composed of k distinct blocks of the same length, n/k, called the antiperiod. An arbitrary string will have antiperiod t if it is prefix of an antipower with antiperiod t. In this paper, we describe efficient algorithm for computing the smallest antiperiod of a string S of length n in O(n) time. We also describe an algorithm to compute all the antiperiods of S that runs in O(n log n) time.
Lempel-Ziv Decoding in External Memory
(Springer International Publishing, 2016-06-01) Belazzougui, Djamal; Kärkkäinen, Juha; Kempa, Dominik; Puglisi, Simon J.
Simple and fast decoding is one of the main advantages of LZ77-type text encoding used in many popular file compressors such as gzip and 7zip. With the recent introduction of external memory algorithms for Lempel–Ziv factorization there is a need for external memory LZ77 decoding but the standard algorithm makes random accesses to the text and cannot be trivially modified for external memory computation. We describe the first external memory algorithms for LZ77 decoding, prove that their I/O complexity is optimal, and demonstrate that they are very fast in practice, only about three times slower than in-memory decoding (when reading input and writing output is included in the time).
Weighted Ancestors in Suffix Trees Revisited
(Schloss Dagstuhl - Leibniz-Zentrum für Informatik 2021, 2021-06-30) Belazzougui, Djamal; Kosolobov, Dmitry; Puglisi, Simon J.; Raman, Rajeev
The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require Ω(log log n) time for queries provided 𝒪(n polylog n) space is available and weights are from [0..n], where n is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an 𝒪(n)-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string s, we need a data structure that can locate in the tree any substring s[p..q] of s in 𝒪(1) time (as if one descended from the root reading s[p..q] along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values.