CERIST DIGITAL LIBRARYLe système de dépôt numérique DL collecte, emmagasine, indexe, archive, et diffuse du matériel de recherche en format numérique.http://www.cerist.dz:802017-10-05T15:07:43Z2017-10-05T15:07:43ZFast Label Extraction in the CDAWGBelazzougui, DjamalCunial, Fabiohttp://dl.cerist.dz/handle/CERIST/8982017-10-05T10:36:40Z2017-09-06T00:00:00ZFast Label Extraction in the CDAWG
Belazzougui, Djamal; Cunial, Fabio
The compact directed acyclic word graph (CDAWG) of a string T of length n takes space proportional just to the number e of right extensions of the maximal repeats of T, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which e grows significantly more slowly than n. We reduce from O(m log log n) to O(m) the time needed to count the number of occurrences of a pattern of length m, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from O(m log log n+occ) to O(m+occ) in the time needed to locate all the occocc occurrences of the pattern. We also reduce from O(k log log n) to O(k) the time needed to read the k characters of the label of an edge of the suffix tree of T, and we reduce from O(m log log n) to O(m) the time needed to compute the matching statistics between a query of length m and T, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.
2017-09-06T00:00:00ZA Framework for Space-Efficient String KernelsBelazzougui, DjamalCunial, Fabiohttp://dl.cerist.dz/handle/CERIST/8972017-10-05T10:36:40Z2017-02-17T00:00:00ZA Framework for Space-Efficient String Kernels
Belazzougui, Djamal; Cunial, Fabio
String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact kernels on pairs of strings of total length n, like the k-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in O(nd) time and in o(n) bits of space in addition to the input, using just a rangeDistinct data structure on the Burrows–Wheeler transform of the input strings that takes O(d) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of k, like the k-mer profile and the k-th order empirical entropy, and for calibrating the value of k using the data. All such algorithms become O(n) using a suitable implementation of the rangeDistinct data structure, and by concatenating them to a suitable BWT construction algorithm, we can compute all the mentioned kernels and complexity measures, directly from the input strings, in O(n) time and in O(n log σ) bits of space in addition to the input, where σ is the size of the alphabet. Using similar data structures, we also show how to build a compact representation of the variable-length Markov chain of a string T of length n, that takes just 3n log σ+o(n log σ) bits of space, and that can be learnt in randomized O(n) time using O(n log σ) bits of space in addition to the input. Such model can then be used to assign a probability to a query string S of length m in O(m) time and in 2m+o(m) bits of additional space, thus providing an alternative, compositional measure of the similarity between S and T that does not require alignment.
2017-02-17T00:00:00ZFlexible Indexing of Repetitive CollectionsBelazzougui, DjamalCunial, FabioGagie, TravisPrezza, NicolaRaffinot, Mathieuhttp://dl.cerist.dz/handle/CERIST/8962017-09-27T10:14:00Z2017-06-07T00:00:00ZFlexible Indexing of Repetitive Collections
Belazzougui, Djamal; Cunial, Fabio; Gagie, Travis; Prezza, Nicola; Raffinot, Mathieu
Highly repetitive strings are increasingly being amassed by
genome sequencing experiments, and by versioned archives of source code
and webpages. We describe practical data structures that support count-
ing and locating all the exact occurrences of a pattern in a repetitive
text, by combining the run-length encoded Burrows-Wheeler transform
(RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such vari-
ant uses an amount of space comparable to LZ77 indexes, but it answers
count queries between two and four orders of magnitude faster than
all LZ77 and hybrid index implementations, at the cost of slower lo-
cate queries. Combining the RLBWT with the compact directed acyclic
word graph answers locate queries for short patterns between four and
ten times faster than a version of the run-length compressed suffix ar-
ray (RLCSA) that uses comparable memory, and with very short pat-
terns our index achieves speedups even greater than ten with respect to
RLCSA.
2017-06-07T00:00:00ZRepresenting the Suffix Tree with the CDAWGBelazzougui, DjamalCunial, Fabiohttp://dl.cerist.dz/handle/CERIST/8952017-09-27T10:14:00Z2017-07-04T00:00:00ZRepresenting the Suffix Tree with the CDAWG
Belazzougui, Djamal; Cunial, Fabio
Given a string T, it is known that its suffix tree can be represented using the compact directed acyclic word graph (CDAWG) with e_T arcs, taking overall O(e_T+e_REV(T)) words of space, where REV(T) is the reverse of T, and supporting some key operations in time between O(1) and O(log(log(n))) in the worst case. This representation is especially appealing for highly repetitive strings, like collections of similar genomes or of version-controlled documents, in which e_T grows sublinearly in the length of T in practice. In this paper we augment such representation, supporting a number of additional queries in worst-case time between O(1) and O(log(n)) in the RAM model, without increasing space complexity asymptotically. Our technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the suffix array, of the inverse suffix array, and of T itself, that takes O(e_T) words of space, and that supports random access in O(log(n)) time. Furthermore, we establish a connection between the reversed CDAWG of T and a context-free grammar that produces T and only T, which might have independent interest.
2017-07-04T00:00:00Z