International Conference Papers

Permanent URI for this collection


Recent Submissions

Now showing 1 - 5 of 213
  • Item
    Weighted Ancestors in Suffix Trees Revisited
    (Schloss Dagstuhl - Leibniz-Zentrum für Informatik 2021, 2021-06-30) Belazzougui, Djamal; Kosolobov, Dmitry; Puglisi, Simon J.; Raman, Rajeev
    The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require Ω(log log n) time for queries provided 𝒪(n polylog n) space is available and weights are from [0..n], where n is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an 𝒪(n)-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string s, we need a data structure that can locate in the tree any substring s[p..q] of s in 𝒪(1) time (as if one descended from the root reading s[p..q] along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values.
  • Item
    DIAG a diagnostic web application based on lung CT Scan images and deep learning
    (IOS Press Ebooks, 2021-05-29) Hadj Bouzid, Amel Imene; Yahiaoui, Saïd; Lounis, Anis; Berrani, Sid-Ahmed; Belbachir, Hacène; Naili, Qaid; Abdi, Mohamed El Hafedh; Bensalah, Kawthar; Belazzougui, Djamal
    Coronavirus disease is a pandemic that has infected millions of people around the world. Lung CT-scans are effective diagnostic tools, but radiologists can quickly become overwhelmed by the flow of infected patients. Therefore, automated image interpretation needs to be achieved. Deep learning (DL) can support critical medical tasks including diagnostics, and DL algorithms have successfully been applied to the classification and detection of many diseases. This work aims to use deep learning methods that can classify patients between Covid-19 positive and healthy patient. We collected 4 available datasets, and tested our convolutional neural networks (CNNs) on different distributions to investigate the generalizability of our models. In order to clearly explain the predictions, Grad-CAM and Fast-CAM visualization methods were used. Our approach reaches more than 92% accuracy on 2 different distributions. In addition, we propose a computer aided diagnosis web application for Covid-19 diagnosis. The results suggest that our proposed deep learning tool can be integrated to the Covid-19 detection process and be useful for a rapid patient management.
  • Item
    Space-Efficient Representation of Genomic k-Mer Count Tables
    (Schloss Dagstuhl -- Leibniz-Zentrum für Informatik, 2021-07-22) Shibuya, Yoshihiro; Belazzougui, Djamal; Kucherov, Gregory
    k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Output formats could rely on quotienting to reduce the space of k-mers in hash tables, however counts are not usually stored in space-efficient formats. Overall, k-mer count tables for genomic data take a considerable space, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom Filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E.Coli and C.Elegans) as well as on k-mer document frequency tables for 29 E.Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s.
  • Item
    Smaller Fully-Functional Bidirectional BWT Indexes
    (Springer Nature, 2020-09-17) Belazzougui,, Djamal; Cunial, Fabio
    Burrows-Wheeler indexes that support both extending and contracting any substring of the text $T$ of length $n$ on which they are built, in any direction, provide substantial flexibility in traversing the text and can be used to implement several algorithms. The practical appeal of such indexes is contingent on them being compact, and current designs that are sensitive to the compressibility of the input take either $O(e+\REV{e})$ words of space, where $e$ and $\REV{e}$ are the number of right and left extensions of the maximal repeats of $T$, or $O(r\log(n/r)+\REV{r}\log(n/\REV{r}))$ words, where $r$ and $\REV{r}$ are the number of runs in the Burrows-Wheeler transform of $T$ and of its reverse. In this paper we describe a fully-functional bidirectional index that takes $O(m+r+\REV{r})$ words, where $m$ is the number of maximal repeats of $T$, as well as a variant that takes $O(r+\REV{r})$ words.
  • Item
    Efficient tree-structured categorical retrieval
    (Leibniz International Proceedings in Informatics (LIPIcs), 2020-06-09) Belazzougui, Djamal; Kucherov, Gregory
    We study a document retrieval problem in the new framework where D text documents are organized in a category tree with a predefined number h of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern p and a category (level in the category tree), we wish to efficiently retrieve the t categorical units containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses n(log σ(1+o(1))+log D + O(h)) + O(∆) bits of space and O(|p| + t) query time, where n is the total length of the documents, σ the size of the alphabet used in the documents and ∆ is the total number of nodes in the category tree. Another solution uses n(log σ(1+o(1))+O(log D))+O(∆)+O(D log n) bits of space and O(|p| + t log D) query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.