Space-Efficient Representation of Genomic k-Mer Count Tables

Shibuya, Yoshihiro; Belazzougui, Djamal; Kucherov, Gregory

Space-Efficient Representation of Genomic k-Mer Count Tables

Date

2021-07-22

Authors

Shibuya, Yoshihiro

Belazzougui, Djamal

Kucherov, Gregory

Publisher

Schloss Dagstuhl -- Leibniz-Zentrum für Informatik

Abstract

k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Output formats could rely on quotienting to reduce the space of k-mers in hash tables, however counts are not usually stored in space-efficient formats. Overall, k-mer count tables for genomic data take a considerable space, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom Filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E.Coli and C.Elegans) as well as on k-mer document frequency tables for 29 E.Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s.

Keywords

k-mer counting, Eata structures, Compression, Minimizers, Compressed static function, Bloom filter, Empirical entropy

URI

https://dl.cerist.dz/handle/CERIST/994

Collections

International Conference Papers

Full item page

Space-Efficient Representation of Genomic k-Mer Count Tables

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By