Browsing by Author "Belazzougui, Djamal"

Now showing 1 - 20 of 26

A Concise Forwarding Information Base for Scalable and Fast Name Lookups
(IEEE, 2017-11-30) Yu, Ye; Belazzougui, Djamal; Qian, Chen; Zhang, Qin
Forwarding information base (FIB) scalability and its lookup speed are fundamental problems of numerous net- work technologies that uses location-independent network names. In this paper we present a new network algorithm, Othello Hashing, and its application of a FIB design called Concise, which uses very little memory to support ultra-fast lookups of network names. Othello Hashing and Concise make use of minimal perfect hashing and relies on the programmable network framework to support dynamic updates. Our conceptual contribution of Concise is to optimize the memory efficiency and query speed in the data plane and move the relatively complex construction and update components to the resource- rich control plane. We implemented Concise on three platforms. Experimental results show that Concise uses significantly smaller memory to achieve much faster query speed compared to existing solutions of network name lookups.
A framework for space-efficient read clustering in metagenomic samples
(BioMed Central, 2017-03-14) Alanko, Jarno; Cunial, Fabio; Belazzougui, Djamal; Mäkinen, Veli
Background: A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Since samples are large and steadily growing, space-efficient clustering algorithms are strongly needed. Results: We design and implement a space-efficient algorithmic framework that solves a number of core primitives in unsupervised metagenomic clustering using just the bidirectional Burrows-Wheeler index and a union-find data structure on the set of reads. When run on a sample of total length n, with m reads of maximum length ℓ each, on an alphabet of total size σ, our algorithms take O(n(t+logσ)) time and just 2n+o(n)+O(max{ℓ σlogn,K logm}) bits of space in addition to the index and to the union-find data structure, where K is a measure of the redundancy of the sample and t is the query time of the union-find data structure. Conclusions: Our experimental results show that our algorithms are practical, they can exploit multiple cores by a parallel traversal of the suffix-link tree, and they are competitive both in space and in time with the state of the art.
A Framework for Space-Efficient String Kernels
(Springer, 2017-02-17) Belazzougui, Djamal; Cunial, Fabio
String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact kernels on pairs of strings of total length n, like the k-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in O(nd) time and in o(n) bits of space in addition to the input, using just a rangeDistinct data structure on the Burrows–Wheeler transform of the input strings that takes O(d) time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple values of k, like the k-mer profile and the k-th order empirical entropy, and for calibrating the value of k using the data. All such algorithms become O(n) using a suitable implementation of the rangeDistinct data structure, and by concatenating them to a suitable BWT construction algorithm, we can compute all the mentioned kernels and complexity measures, directly from the input strings, in O(n) time and in O(n log ⁡σ) bits of space in addition to the input, where σ is the size of the alphabet. Using similar data structures, we also show how to build a compact representation of the variable-length Markov chain of a string T of length n, that takes just 3n log ⁡σ+o(n log ⁡σ) bits of space, and that can be learnt in randomized O(n) time using O(n log ⁡σ) bits of space in addition to the input. Such model can then be used to assign a probability to a query string S of length m in O(m) time and in 2m+o(m) bits of additional space, thus providing an alternative, compositional measure of the similarity between S and T that does not require alignment.
Bidirectional Variable-Order de Bruijn Graphs
(World Scientific Publishing, 2018-12) Belazzougui, Djamal; Gagie, Travis; Mäkinen, Veli; Previtali, Marco; Puglisi, Simon J.
Compressed suffix trees and bidirectional FM-indexes can store a set of strings and support queries that let us explore the set of substrings they contain, adding and deleting characters on both the left and right, but they can use much more space than a de Bruijn graph for the strings. Bowe et al.’s BWT-based de Bruijn graph representation (Proc. Workshop on Algorithms for Bioinformatics, pp. 225–235, 2012) can be made bidirectional as well, at the cost of increasing its space usage by a small constant, but it fixes the length of the substrings. Boucher et al. (Proc. Data Compression Conference, pp. 383–392, 2015) generalized Bowe et al.’s representation to support queries about variable-length substrings, but at the cost of bidirectionality. In this paper we show how to make Boucher et al.’s variable-order implementation of de Bruijn graphs bidirectional.
Bidirectional Variable-Order de Bruijn Graphs
(Springer International Publishing, 2016-03-22) Belazzougui, Djamal; Gagie, Travis; Mäkinen, Veli; Previtali, Marco; Puglisi, Simon J.
Implementing de Bruijn graphs compactly is an important problem because of their role in genome assembly. There are currently two main approaches, one using Bloom filters and the other using a kind of Burrows-Wheeler Transform on the edge labels of the graph. The second representation is more elegant and can even handle many graph-orders at once, but it does not cleanly support traversing edges backwards or inserting new nodes or edges. In this paper we resolve the first of these issues and partially address the second.
Block Trees
(ELSEVIER, 2021-05) Belazzougui, Djamal; Cáceres, Manuel; Gagie, Travis; Gawrychowski, Paweł; Kärkkäinen, Juha; Navarro, Gonzalo; Ordóñez, Alberto; Puglisi, Simon J.; Tabei, Yasuo
Let string S[1..n] be parsed into z phrases by the Lempel-Ziv algorithm. The corresponding compression algorithm encodes S in 𝒪(z) space, but it does not support random access to S. We introduce a data structure, the block tree, that represents S in 𝒪(z log(n/z)) space and extracts any symbol of T in time 𝒪(log(n/z)), among other space-time tradeoffs. By multiplying the space by the alphabet size, we also support rank and select queries, which are useful for building compressed data structures on top of S. Further, block trees can be built in a scalable manner. Our experiments show that block trees offer relevant space-time tradeoffs compared to other compressed string representations for highly repetitive strings.
Computing the Antiperiod(s) of a String
(Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, 2019-06-18) Alamro, Hayam; Badkobeh, Golnaz; Belazzougui, Djamal; Iliopoulos, Costas S.; Puglisi, Simon J.
A string S[1, n] is a power (or repetition or tandem repeat) of order k and period n/k, if it can be decomposed into k consecutive identical blocks of length n/k. Powers and periods are fundamental structures in the study of strings and algorithms to compute them efficiently have been widely studied. Recently, Fici et al. (Proc. ICALP 2016) introduced an antipower of order k to be a string composed of k distinct blocks of the same length, n/k, called the antiperiod. An arbitrary string will have antiperiod t if it is prefix of an antipower with antiperiod t. In this paper, we describe efficient algorithm for computing the smallest antiperiod of a string S of length n in O(n) time. We also describe an algorithm to compute all the antiperiods of S that runs in O(n log n) time.
DIAG a diagnostic web application based on lung CT Scan images and deep learning
(IOS Press Ebooks, 2021-05-29) Hadj Bouzid, Amel Imene; Yahiaoui, Saïd; Lounis, Anis; Berrani, Sid-Ahmed; Belbachir, Hacène; Naili, Qaid; Abdi, Mohamed El Hafedh; Bensalah, Kawthar; Belazzougui, Djamal
Coronavirus disease is a pandemic that has infected millions of people around the world. Lung CT-scans are effective diagnostic tools, but radiologists can quickly become overwhelmed by the flow of infected patients. Therefore, automated image interpretation needs to be achieved. Deep learning (DL) can support critical medical tasks including diagnostics, and DL algorithms have successfully been applied to the classification and detection of many diseases. This work aims to use deep learning methods that can classify patients between Covid-19 positive and healthy patient. We collected 4 available datasets, and tested our convolutional neural networks (CNNs) on different distributions to investigate the generalizability of our models. In order to clearly explain the predictions, Grad-CAM and Fast-CAM visualization methods were used. Our approach reaches more than 92% accuracy on 2 different distributions. In addition, we propose a computer aided diagnosis web application for Covid-19 diagnosis. The results suggest that our proposed deep learning tool can be integrated to the Covid-19 detection process and be useful for a rapid patient management.
Edit Distance: Sketching, Streaming, and Document Exchange
(IEEE, 2016-12-19) Belazzougui, Djamal; Zhang, Qin
We show that in the document exchange problem, where Alice holds x ∈ {0, 1}^n and Bob holds y ∈ {0, 1}^n , Alice can send Bob a message of size O(K(log^2 K + log n)) bits such that Bob can recover x using the message and his input y if the edit distance between x and y is no more than K, and output “error” otherwise. Both the encoding and decoding can be done in time O~(n+poly(K)). This result significantly improves on the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold x and y respectively, they can compute sketches of x and y of sizes poly(K log n) bits (the encoding), and send to the referee, who can then compute the edit distance between x and y together with all the edit operations if the edit distance is no more than K, and output “error” otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using poly(K log n) bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using poly(K log n) bits of space.
Efficient tree-structured categorical retrieval
(Leibniz International Proceedings in Informatics (LIPIcs), 2020-06-09) Belazzougui, Djamal; Kucherov, Gregory
We study a document retrieval problem in the new framework where D text documents are organized in a category tree with a predefined number h of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern p and a category (level in the category tree), we wish to efficiently retrieve the t categorical units containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses n(log σ(1+o(1))+log D + O(h)) + O(∆) bits of space and O(|p| + t) query time, where n is the total length of the documents, σ the size of the alphabet used in the documents and ∆ is the total number of nodes in the category tree. Another solution uses n(log σ(1+o(1))+O(log D))+O(∆)+O(D log n) bits of space and O(|p| + t log D) query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.
Fast Label Extraction in the CDAWG
(Springer, 2017-09-06) Belazzougui, Djamal; Cunial, Fabio
The compact directed acyclic word graph (CDAWG) of a string T of length n takes space proportional just to the number e of right extensions of the maximal repeats of T, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which e grows significantly more slowly than n. We reduce from O(m log ⁡log⁡ n) to O(m) the time needed to count the number of occurrences of a pattern of length m, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from O(m log log n+occ) to O(m+occ) in the time needed to locate all the occocc occurrences of the pattern. We also reduce from O(k log ⁡log ⁡n) to O(k) the time needed to read the k characters of the label of an edge of the suffix tree of T, and we reduce from O(m log ⁡log ⁡n) to O(m) the time needed to compute the matching statistics between a query of length m and T, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.
Fast matching statistics in small space
(Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, 2018-06-27) Belazzougui, Djamal; Cunial, Fabio; Denas, Olgert
Computing the matching statistics of a string S with respect to a string T on an alphabet of size sigma is a fundamental primitive for a number of large-scale string analysis applications, including the comparison of entire genomes, for which space is a pressing issue. This paper takes from theory to practice an existing algorithm that uses just O(|T|log{sigma}) bits of space, and that computes a compact encoding of the matching statistics array in O(|S|log{sigma}) time. The techniques used to speed up the algorithm are of general interest, since they optimize queries on the existence of a Weiner link from a node of the suffix tree, and parent operations after unsuccessful Weiner links. Thus, they can be applied to other matching statistics algorithms, as well as to any suffix tree traversal that relies on such calls. Some of our optimizations yield a matching statistics implementation that is up to three times faster than a plain version of the algorithm, depending on the similarity between S and T. In genomic datasets of practical significance we achieve speedups of up to 1.8, but our fastest implementations take on average twice the time of an existing code based on the LCP array. The key advantage is that our implementations need between one half and one fifth of the competitor's memory, and they approach comparable running times when S and T are very similar.
Flexible Indexing of Repetitive Collections
(Springer, 2017-06-07) Belazzougui, Djamal; Cunial, Fabio; Gagie, Travis; Prezza, Nicola; Raffinot, Mathieu
Highly repetitive strings are increasingly being amassed by genome sequencing experiments, and by versioned archives of source code and webpages. We describe practical data structures that support count- ing and locating all the exact occurrences of a pattern in a repetitive text, by combining the run-length encoded Burrows-Wheeler transform (RLBWT) with the boundaries of Lempel-Ziv 77 factors. One such vari- ant uses an amount of space comparable to LZ77 indexes, but it answers count queries between two and four orders of magnitude faster than all LZ77 and hybrid index implementations, at the cost of slower lo- cate queries. Combining the RLBWT with the compact directed acyclic word graph answers locate queries for short patterns between four and ten times faster than a version of the run-length compressed suffix ar- ray (RLCSA) that uses comparable memory, and with very short pat- terns our index achieves speedups even greater than ten with respect to RLCSA.
A framework for space-efficient variable-order Markov models
(Oxford University Press, 2019-11-15) Cunial, Fabio; Alanko, Jarno; Belazzougui, Djamal
Motivation: Markov models with contexts of variable length are widely used in bioinformatics for representing sets of sequences with similar biological properties. When models contain many long contexts, existing implementations are either unable to handle genome-scale training datasets within typical memory budgets, or they are optimized for specific model variants and are thus inflexible. Results: We provide practical, versatile representations of variable-order Markov models and of interpolated Markov models, that support a large number of context-selection criteria, scoring functions, probability smoothing methods, and interpolations, and that take up to four times less space than previous implementations based on the suffix array, regardless of the number and length of contexts, and up to ten times less space than previous trie-based representations, or more, while matching the size of related, state-of-the-art data structures from Natural Language Processing. We describe how to further compress our indexes to a quantity related to the redundancy of the training data, saving up to 90% of their space on very repetitive datasets, and making them become up to sixty times smaller than previous implementations based on the suffix array. Finally, we show how to exploit constraints on the length and frequency of contexts to further shrink our compressed indexes to half of their size or more, achieving data structures that are a hundred times smaller than previous implementations based on the suffix array, or more. This allows variable-order Markov models to be used with bigger datasets and with longer contexts on the same hardware, thus possibly enabling new applications. Availability and implementation: https://github.com/jnalanko/VOMM
Fully Dynamic de Bruijn Graphs
(Springer International Publishing, 2016-09-21) Belazzougui, Djamal; Gagie, Travis; Mäkinen, Veli; Previtali, Marco
We present a space- and time-efficient fully dynamic implementation of de Bruijn graphs, which can also support fixed-length jumbled pattern matching.
Fully-Functional Bidirectional Burrows-Wheeler Indexes and Infinite-Order De Bruijn Graphs
(Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, 2019-06-18) Belazzougui, Djamal; Cunial, Fabio
Given a string T on an alphabet of size σ, we describe a bidirectional Burrows-Wheeler index that takes O(|T| log σ) bits of space, and that supports the addition and removal of one character, on the left or right side of any substring of T, in constant time. Previously known data structures that used the same space allowed constant-time addition to any substring of T, but they could support removal only from specific substrings of T. We also describe an index that supports bidirectional addition and removal in O(log log |T|) time, and that takes a number of words proportional to the number of left and right extensions of the maximal repeats of T. We use such fully-functional indexes to implement bidirectional, frequency-aware, variable-order de Bruijn graphs with no upper bound on their order, and supporting natural criteria for increasing and decreasing the order during traversal.
Indexing and querying color sets of images
(Springer Varlag, 2016-09-27) Belazzougui, Djamal; Kolpakov, Roman; Raffinot, Mathieu
We aim to study the set of color sets of continuous regions of an image given as a matrix of m rows over n ≥ m columns where each element in the matrix is an integer from [1,σ] named a color . The set of distinct colors in a region is called fingerprint. We aim to compute, index and query the fingerprints of all rectangular regions named rectangles. The set of all such fingerprints is denoted by F. A rectangle is maximal if it is not contained in a greater rectangle with the same fingerprint. The set of all locations of maximal rectangles is denoted by L. We first explain how to determine all the |L| maximal locations with their fingerprints in expected time O(n⋅m^2⋅σ) using a Monte Carlo algorithm (with polynomially small probability of error) or within deterministic O(n⋅m^2⋅σ⋅log⁡(|L|/(n⋅m^2)+2)) time. We then show how to build a data structure which occupies O(n⋅m⋅log ⁡n+|L|) space such that a query which asks for all the maximal locations with a given fingerprint f can be answered in time O(|f|+log⁡ log ⁡n+k), where k is the number of maximal locations with fingerprint f. If the query asks only for the presence of the fingerprint, then the space usage becomes O(n⋅m⋅log ⁡n+|F|) while the query time becomes O(|f|+log ⁡log ⁡n). We eventually consider the special case of squared regions (squares).
Lempel-Ziv Decoding in External Memory
(Springer International Publishing, 2016-06-01) Belazzougui, Djamal; Kärkkäinen, Juha; Kempa, Dominik; Puglisi, Simon J.
Simple and fast decoding is one of the main advantages of LZ77-type text encoding used in many popular file compressors such as gzip and 7zip. With the recent introduction of external memory algorithms for Lempel–Ziv factorization there is a need for external memory LZ77 decoding but the standard algorithm makes random accesses to the text and cannot be trivially modified for external memory computation. We describe the first external memory algorithms for LZ77 decoding, prove that their I/O complexity is optimal, and demonstrate that they are very fast in practice, only about three times slower than in-memory decoding (when reading input and writing output is included in the time).
Linear-time string indexing and analysis in small space
(Association for Computing Machinery, 2020-03-09) Belazzougui, Djamal; Cunial, Fabio; Kärkkäinen, Juha; Mäkinen, Veli
The field of succinct data structures has flourished over the last 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. For example, one can compare two genomes by building a common index for their concatenation, and by detecting common substructures by querying the index. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis. We show that the BWT of a string T ∈ {1, . . . ,σ }^n can be built in deterministic O (n) time using just O (n log σ ) bits of space, where σ ≤ n. Deterministic linear time is achieved by exploiting a new partial rank data structure that supports queries in constant time, and that might have independent interest. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of T . Many fundamental string analysis problems, such as maximal repeats, maximal unique matches, and string kernels, can be mapped to such enumeration, and can thus be solved in deterministic O (n) time and in O (n log σ ) bits of space from the input string, by tailoring the enumeration algorithm to some problem-specific computations. We also show how to build many of the existing indexes based on the BWT, such as the compressed suffix array, the compressed suffix tree, and the bidirectional BWT index, in randomized O (n) time and in O (n log σ ) bits of space. The previously fastest construction algorithms for BWT, compressed suffix array and compressed suffix tree, which used O (n log σ ) bits of space, took O (n log log σ ) time for the first two structures, and O (n log^ϵ n) time for the third, where ϵ is any positive constant smaller than one. Alternatively, the BWT could be previously built in linear time if one was willing to spend O (n log σ log log_σ n) bits of space. Contrary to the state of the art, our bidirectional BWT index supports every operation in constant time per element in its output.
Memory-Efficient and Ultra-Fast Network Lookup and Forwarding Using Othello Hashing
(IEEE, 2018-04-11) Yu, Ye; Belazzougui, Djamal; Qian, Chen; Zhang, Qin
Abstract: Network algorithms always prefer low memory cost and fast packet processing speed. Forwarding information base (FIB), as a typical network processing component, requires a scalable and memory-efficient algorithm to support fast lookups. In this paper, we present a new network algorithm, Othello hashing, and its application of a FIB design called concise, which uses very little memory to support ultra-fast lookups of network names. Othello hashing and concise make use of minimal perfect hashing and relies on the programmable network framework to support dynamic updates. Our conceptual contribution of concise is to optimize the memory efficiency and query speed in the data plane and move the relatively complex construction and update components to the resource-rich control plane. We implemented concise on three platforms. Experimental results show that concise uses significantly smaller memory to achieve much faster query speed compared to existing solutions of network name lookups.