Movi: a fast and cache-efficient full-text pangenome index.

التفاصيل البيبلوغرافية
العنوان: Movi: a fast and cache-efficient full-text pangenome index.
المؤلفون: Zakeri M; Department of Computer Science, Johns Hopkins University., Brown NK; Department of Computer Science, Johns Hopkins University., Ahmed OY; Department of Computer Science, Johns Hopkins University., Gagie T; Faculty of Computer Science, Dalhousie University., Langmead B; Department of Computer Science, Johns Hopkins University.
المصدر: BioRxiv : the preprint server for biology [bioRxiv] 2024 Feb 15. Date of Electronic Publication: 2024 Feb 15.
نوع المنشور: Preprint
اللغة: English
بيانات الدورية: Country of Publication: United States NLM ID: 101680187 Publication Model: Electronic Cited Medium: Internet NLM ISO Abbreviation: bioRxiv Subsets: PubMed not MEDLINE
مستخلص: Efficient pangenome indexes are promising tools for many applications, including rapid classification of nanopore sequencing reads. Recently, a compressed-index data structure called the "move structure" was proposed as an alternative to other BWT-based indexes like the FM index and r-index. The move structure uniquely achieves both O(r) space and O(1)-time queries, where r is the number of runs in the pangenome BWT. We implemented Movi, an efficient tool for building and querying move-structure pangenome indexes. While the size of the Movi's index is larger than the r-index, it scales at a smaller rate for pangenome references, as its size is exactly proportional to r, the number of runs in the BWT of the reference. Movi can compute sophisticated matching queries needed for classification - such as pseudo-matching lengths and backward search - at least ten times faster than the fastest available methods, and in some cases more than 30-fold faster. Movi achieves this speed by leveraging the move structure's strong locality of reference, incurring close to the minimum possible number of cache misses for queries against large pangenomes. We achieve still further speed improvements by using memory prefetching to attain a degree of latency hiding that would be difficult with other index structures like the r-index. Movi's fast constant-time query loop makes it well suited to real-time applications like adaptive sampling for nanopore sequencing, where decisions must be made in a small and predictable time interval.
References: Nat Commun. 2016 Apr 13;7:11257. (PMID: 27071849)
Nat Biotechnol. 2021 Apr;39(4):431-441. (PMID: 33257863)
Genome Res. 2016 Dec;26(12):1721-1729. (PMID: 27852649)
Bioinformatics. 2021 May 5;37(5):589-595. (PMID: 32976553)
Genome Biol. 2014 Mar 03;15(3):R46. (PMID: 24580807)
Genome Res. 2023 Jul;33(7):1069-1077. (PMID: 37258301)
J Comput Biol. 2022 Feb;29(2):169-187. (PMID: 35041495)
Algorithms Mol Biol. 2024 Jan 22;19(1):3. (PMID: 38254124)
Genome Biol. 2023 May 18;24(1):122. (PMID: 37202771)
iScience. 2021 Jun 08;24(6):102696. (PMID: 34195571)
Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45. (PMID: 26553804)
Genome Biol. 2019 Nov 28;20(1):257. (PMID: 31779668)
Nature. 2023 May;617(7960):312-324. (PMID: 37165242)
Algorithms Mol Biol. 2019 May 24;14:13. (PMID: 31149025)
معلومات مُعتمدة: R01 HG011392 United States HG NHGRI NIH HHS
تواريخ الأحداث: Date Created: 20231114 Latest Revision: 20240228
رمز التحديث: 20240228
مُعرف محوري في PubMed: PMC10635132
DOI: 10.1101/2023.11.04.565615
PMID: 37961660
قاعدة البيانات: MEDLINE
الوصف
DOI:10.1101/2023.11.04.565615