Flexible Indexing of Repetitive Collections
Loading...
Date
2017-06-07
Journal Title
Journal ISSN
Volume Title
Publisher
Springer
Abstract
Highly repetitive strings are increasingly being amassed by
genome sequencing experiments, and by versioned archives of source code
and webpages. We describe practical data structures that support count-
ing and locating all the exact occurrences of a pattern in a repetitive
text, by combining the run-length encoded Burrows-Wheeler transform
(RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such vari-
ant uses an amount of space comparable to LZ77 indexes, but it answers
count queries between two and four orders of magnitude faster than
all LZ77 and hybrid index implementations, at the cost of slower lo-
cate queries. Combining the RLBWT with the compact directed acyclic
word graph answers locate queries for short patterns between four and
ten times faster than a version of the run-length compressed suffix ar-
ray (RLCSA) that uses comparable memory, and with very short pat-
terns our index achieves speedups even greater than ten with respect to
RLCSA.
Description
Keywords
repetitive string indexing, run-length encoded Burrows-Wheeler transform, Lempel-Ziv 77 compression, CDAWG, RLCSA