International Conference Papers
Permanent URI for this collectionhttp://dl.cerist.dz/handle/CERIST/4
Browse
16 results
Search Results
Item Weighted Ancestors in Suffix Trees Revisited(Schloss Dagstuhl - Leibniz-Zentrum für Informatik 2021, 2021-06-30) Belazzougui, Djamal; Kosolobov, Dmitry; Puglisi, Simon J.; Raman, RajeevThe weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require Ω(log log n) time for queries provided 𝒪(n polylog n) space is available and weights are from [0..n], where n is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an 𝒪(n)-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string s, we need a data structure that can locate in the tree any substring s[p..q] of s in 𝒪(1) time (as if one descended from the root reading s[p..q] along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values.Item DIAG a diagnostic web application based on lung CT Scan images and deep learning(IOS Press Ebooks, 2021-05-29) Hadj Bouzid, Amel Imene; Yahiaoui, Saïd; Lounis, Anis; Berrani, Sid-Ahmed; Belbachir, Hacène; Naili, Qaid; Abdi, Mohamed El Hafedh; Bensalah, Kawthar; Belazzougui, DjamalCoronavirus disease is a pandemic that has infected millions of people around the world. Lung CT-scans are effective diagnostic tools, but radiologists can quickly become overwhelmed by the flow of infected patients. Therefore, automated image interpretation needs to be achieved. Deep learning (DL) can support critical medical tasks including diagnostics, and DL algorithms have successfully been applied to the classification and detection of many diseases. This work aims to use deep learning methods that can classify patients between Covid-19 positive and healthy patient. We collected 4 available datasets, and tested our convolutional neural networks (CNNs) on different distributions to investigate the generalizability of our models. In order to clearly explain the predictions, Grad-CAM and Fast-CAM visualization methods were used. Our approach reaches more than 92% accuracy on 2 different distributions. In addition, we propose a computer aided diagnosis web application for Covid-19 diagnosis. The results suggest that our proposed deep learning tool can be integrated to the Covid-19 detection process and be useful for a rapid patient management.Item Space-Efficient Representation of Genomic k-Mer Count Tables(Schloss Dagstuhl -- Leibniz-Zentrum für Informatik, 2021-07-22) Shibuya, Yoshihiro; Belazzougui, Djamal; Kucherov, Gregoryk-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Output formats could rely on quotienting to reduce the space of k-mers in hash tables, however counts are not usually stored in space-efficient formats. Overall, k-mer count tables for genomic data take a considerable space, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom Filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E.Coli and C.Elegans) as well as on k-mer document frequency tables for 29 E.Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s.Item Efficient tree-structured categorical retrieval(Leibniz International Proceedings in Informatics (LIPIcs), 2020-06-09) Belazzougui, Djamal; Kucherov, GregoryWe study a document retrieval problem in the new framework where D text documents are organized in a category tree with a predefined number h of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern p and a category (level in the category tree), we wish to efficiently retrieve the t categorical units containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses n(log σ(1+o(1))+log D + O(h)) + O(∆) bits of space and O(|p| + t) query time, where n is the total length of the documents, σ the size of the alphabet used in the documents and ∆ is the total number of nodes in the category tree. Another solution uses n(log σ(1+o(1))+O(log D))+O(∆)+O(D log n) bits of space and O(|p| + t log D) query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.Item Fully-Functional Bidirectional Burrows-Wheeler Indexes and Infinite-Order De Bruijn Graphs(Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, 2019-06-18) Belazzougui, Djamal; Cunial, FabioGiven a string T on an alphabet of size σ, we describe a bidirectional Burrows-Wheeler index that takes O(|T| log σ) bits of space, and that supports the addition and removal of one character, on the left or right side of any substring of T, in constant time. Previously known data structures that used the same space allowed constant-time addition to any substring of T, but they could support removal only from specific substrings of T. We also describe an index that supports bidirectional addition and removal in O(log log |T|) time, and that takes a number of words proportional to the number of left and right extensions of the maximal repeats of T. We use such fully-functional indexes to implement bidirectional, frequency-aware, variable-order de Bruijn graphs with no upper bound on their order, and supporting natural criteria for increasing and decreasing the order during traversal.Item Computing the Antiperiod(s) of a String(Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, 2019-06-18) Alamro, Hayam; Badkobeh, Golnaz; Belazzougui, Djamal; Iliopoulos, Costas S.; Puglisi, Simon J.A string S[1, n] is a power (or repetition or tandem repeat) of order k and period n/k, if it can be decomposed into k consecutive identical blocks of length n/k. Powers and periods are fundamental structures in the study of strings and algorithms to compute them efficiently have been widely studied. Recently, Fici et al. (Proc. ICALP 2016) introduced an antipower of order k to be a string composed of k distinct blocks of the same length, n/k, called the antiperiod. An arbitrary string will have antiperiod t if it is prefix of an antipower with antiperiod t. In this paper, we describe efficient algorithm for computing the smallest antiperiod of a string S of length n in O(n) time. We also describe an algorithm to compute all the antiperiods of S that runs in O(n log n) time.Item Random input helps searching predecessors(CEUR-WS.org, 2018-06-17) Belazzougui, Djamal; Kaporis, Alexis C.; Spirakis, Paul G.A data structure problem consists of the finite sets: D of data, Q of queries, A of query answers, associated with a function f : D ×Q → A. The data structure of file X is “static” (“dynamic”) if we “do not” (“do”) require quick updates as X changes. An important goal is to compactly encode a file X ∈ D, such that for each query y ∈ Q, function f(X, y) requires the minimum time to compute an answer in A. This goal is trivial if the size of D is large, since for each query y ∈ Q, it was shown that f(X, y) requires O(1) time for the most important queries in the literature. Hence, this goal becomes interesting to study as a trade off between the “storage space” and the “query time”, both measured as functions of the file size n = |X|. The ideal solution would be to use linear O(n) = O(|X|) space, while retaining a constant O(1) query time. However, if f(X, y) computes the static predecessor search (find largest x ∈ X : x ≤ y), then Ajtai [Ajt88] proved a negative result. By using just n O(1) = |X| O(1) data space, then it is not possible to evaluate f(X, y) in O(1) time ∀y ∈ Q. The proof exhibited a bad distribution of data D, such that ∃y ∗ ∈ Q (a “difficult” query y ∗ ), that f(X, y∗ ) requires ω(1) time. Essentially [Ajt88] is an existential result, resolving the worst case scenario. But, [Ajt88] left open the question: do we typically, that is, with high probability (w.h.p.) 1 encounter such “difficult” queries y ∈ Q, when assuming reasonable distributions with respect to (w.r.t.) queries and data? Below we make reasonable assumptions w.r.t. the distribution of the queries y ∈ Q, as well as w.r.t. the distribution of data X ∈ D. In two interesting scenarios studied in the literature, we resolve the typical (w.h.p.) query time.Item Fast matching statistics in small space(Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, 2018-06-27) Belazzougui, Djamal; Cunial, Fabio; Denas, OlgertComputing the matching statistics of a string S with respect to a string T on an alphabet of size sigma is a fundamental primitive for a number of large-scale string analysis applications, including the comparison of entire genomes, for which space is a pressing issue. This paper takes from theory to practice an existing algorithm that uses just O(|T|log{sigma}) bits of space, and that computes a compact encoding of the matching statistics array in O(|S|log{sigma}) time. The techniques used to speed up the algorithm are of general interest, since they optimize queries on the existence of a Weiner link from a node of the suffix tree, and parent operations after unsuccessful Weiner links. Thus, they can be applied to other matching statistics algorithms, as well as to any suffix tree traversal that relies on such calls. Some of our optimizations yield a matching statistics implementation that is up to three times faster than a plain version of the algorithm, depending on the similarity between S and T. In genomic datasets of practical significance we achieve speedups of up to 1.8, but our fastest implementations take on average twice the time of an existing code based on the LCP array. The key advantage is that our implementations need between one half and one fifth of the competitor's memory, and they approach comparable running times when S and T are very similar.Item A Concise Forwarding Information Base for Scalable and Fast Name Lookups(IEEE, 2017-11-30) Yu, Ye; Belazzougui, Djamal; Qian, Chen; Zhang, QinForwarding information base (FIB) scalability and its lookup speed are fundamental problems of numerous net- work technologies that uses location-independent network names. In this paper we present a new network algorithm, Othello Hashing, and its application of a FIB design called Concise, which uses very little memory to support ultra-fast lookups of network names. Othello Hashing and Concise make use of minimal perfect hashing and relies on the programmable network framework to support dynamic updates. Our conceptual contribution of Concise is to optimize the memory efficiency and query speed in the data plane and move the relatively complex construction and update components to the resource- rich control plane. We implemented Concise on three platforms. Experimental results show that Concise uses significantly smaller memory to achieve much faster query speed compared to existing solutions of network name lookups.Item Fast Label Extraction in the CDAWG(Springer, 2017-09-06) Belazzougui, Djamal; Cunial, FabioThe compact directed acyclic word graph (CDAWG) of a string T of length n takes space proportional just to the number e of right extensions of the maximal repeats of T, and it is thus an appealing index for highly repetitive datasets, like collections of genomes from similar species, in which e grows significantly more slowly than n. We reduce from O(m log log n) to O(m) the time needed to count the number of occurrences of a pattern of length m, using an existing data structure that takes an amount of space proportional to the size of the CDAWG. This implies a reduction from O(m log log n+occ) to O(m+occ) in the time needed to locate all the occocc occurrences of the pattern. We also reduce from O(k log log n) to O(k) the time needed to read the k characters of the label of an edge of the suffix tree of T, and we reduce from O(m log log n) to O(m) the time needed to compute the matching statistics between a query of length m and T, using an existing representation of the suffix tree based on the CDAWG. All such improvements derive from extracting the label of a vertex or of an arc of the CDAWG using a straight-line program induced by the reversed CDAWG.