Computing all-vs-all MEMs in grammar-compressed text

التفاصيل البيبلوغرافية
العنوان: Computing all-vs-all MEMs in grammar-compressed text
المؤلفون: Diaz-Dominguez, Diego, Salmela, Leena
سنة النشر: 2023
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Information Retrieval, Computer Science - Data Structures and Algorithms
الوصف: We describe a compression-aware method to compute all-vs-all maximal exact matches (MEM) among strings of a repetitive collection $\mathcal{T}$. The key concept in our work is the construction of a fully-balanced grammar $\mathcal{G}$ from $\mathcal{T}$ that meets a property that we call \emph{fix-free}: the expansions of the nonterminals that have the same height in the parse tree form a fix-free set (i.e., prefix-free and suffix-free). The fix-free property allows us to compute the MEMs of $\mathcal{T}$ incrementally over $\mathcal{G}$ using a standard suffix-tree-based MEM algorithm, which runs on a subset of grammar rules at a time and does not decompress nonterminals. By modifying the locally-consistent grammar of Christiansen et al 2020., we show how we can build $\mathcal{G}$ from $\mathcal{T}$ in linear time and space. We also demonstrate that our MEM algorithm runs on top of $\mathcal{G}$ in $O(G +occ)$ time and uses $O(\log G(G+occ))$ bits, where $G$ is the grammar size, and $occ$ is the number of MEMs in $\mathcal{T}$. In the conclusions, we discuss how our idea can be modified to implement approximate pattern matching in compressed space.
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2306.16815
رقم الأكسشن: edsarx.2306.16815
قاعدة البيانات: arXiv