دورية أكاديمية

Disk compression of k-mer sets

التفاصيل البيبلوغرافية
العنوان: Disk compression of k-mer sets
المؤلفون: Amatur Rahman, Rayan Chikhi, Paul Medvedev
المصدر: Algorithms for Molecular Biology, Vol 16, Iss 1, Pp 1-14 (2021)
بيانات النشر: BMC, 2021.
سنة النشر: 2021
المجموعة: LCC:Biology (General)
LCC:Genetics
مصطلحات موضوعية: De Bruijn graphs, Compression, k-mer sets, Spectrum-preserving string sets, Biology (General), QH301-705.5, Genetics, QH426-470
الوصف: Abstract K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.
نوع الوثيقة: article
وصف الملف: electronic resource
اللغة: English
تدمد: 1748-7188
Relation: https://doaj.org/toc/1748-7188
DOI: 10.1186/s13015-021-00192-7
URL الوصول: https://doaj.org/article/7126c47cc903415d9ed68f00ae1e8d99
رقم الأكسشن: edsdoj.7126c47cc903415d9ed68f00ae1e8d99
قاعدة البيانات: Directory of Open Access Journals
الوصف
تدمد:17487188
DOI:10.1186/s13015-021-00192-7