A Framework for Space-Efficient String Kernels

Belazzougui, Djamal; Cunial, Fabio

A Framework for Space-Efficient String Kernels

dc.contributor.author	Belazzougui, Djamal
dc.contributor.author	Cunial, Fabio
dc.date.accessioned	2017-09-20T21:35:15Z
dc.date.available	2017-09-20T21:35:15Z
dc.date.issued	2017-02-17
dc.description.abstract	String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact kernels on pairs of strings of total length n, like the k-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in O(nd) time and in o(n) bits of space in addition to the input, using just a rangeDistinct data structure on the Burrows–Wheeler transform of the input strings that takes O(d) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of k, like the k-mer profile and the k-th order empirical entropy, and for calibrating the value of k using the data. All such algorithms become O(n) using a suitable implementation of the rangeDistinct data structure, and by concatenating them to a suitable BWT construction algorithm, we can compute all the mentioned kernels and complexity measures, directly from the input strings, in O(n) time and in O(n log ⁡σ) bits of space in addition to the input, where σ is the size of the alphabet. Using similar data structures, we also show how to build a compact representation of the variable-length Markov chain of a string T of length n, that takes just 3n log ⁡σ+o(n log ⁡σ) bits of space, and that can be learnt in randomized O(n) time using O(n log ⁡σ) bits of space in addition to the input. Such model can then be used to assign a probability to a query string S of length m in O(m) time and in 2m+o(m) bits of additional space, thus providing an alternative, compositional measure of the similarity between S and T that does not require alignment.	fr_FR
dc.identifier.issn	0178-4617	fr_FR
dc.identifier.issn	1432-0541	fr_FR
dc.identifier.uri	http://dl.cerist.dz/handle/CERIST/897
dc.publisher	Springer	fr_FR
dc.relation.ispartofseries	Algorithmica;Volume 79, Issue 3
dc.relation.pages	857–883	fr_FR
dc.structure	Calcul pervasif et mobile (Pervasive and Mobile Computing group)	fr_FR
dc.subject	Substring kernel	fr_FR
dc.subject	Substring complexity	fr_FR
dc.subject	Burrows–Wheeler transform	fr_FR
dc.subject	Maximal repeat	fr_FR
dc.subject	Minimal absent word	fr_FR
dc.subject	Suffix-link tree	fr_FR
dc.subject	Probabilistic suffix tree	fr_FR
dc.subject	Variable-length Markov chain	fr_FR
dc.subject	Matching statistics	fr_FR
dc.title	A Framework for Space-Efficient String Kernels	fr_FR
dc.type	Article

Collections

International Journal Papers

A Framework for Space-Efficient String Kernels

Files

Collections