تقرير
BarcodeBERT: Transformers for Biodiversity Analysis
العنوان: | BarcodeBERT: Transformers for Biodiversity Analysis |
---|---|
المؤلفون: | Arias, Pablo Millan, Sadjadi, Niousha, Safari, Monireh, Gong, ZeMing, Wang, Austin T., Lowe, Scott C., Haurum, Joakim Bruslund, Zarubiieva, Iuliia, Steinke, Dirk, Kari, Lila, Chang, Angel X., Taylor, Graham W. |
سنة النشر: | 2023 |
المجموعة: | Computer Science |
مصطلحات موضوعية: | Computer Science - Machine Learning |
الوصف: | Understanding biodiversity is a global challenge, in which DNA barcodes - short snippets of DNA that cluster by species - play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across datasets of varying complexity. While simpler datasets and tasks favor supervised CNNs or fine-tuned transformers, challenging species-level identification demands a paradigm shift towards self-supervised pretraining. We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis, leveraging a 1.5 M invertebrate DNA barcode reference library. This work highlights how dataset specifics and coverage impact model selection, and underscores the role of self-supervised pretraining in achieving high-accuracy DNA barcode-based identification at the species and genus level. Indeed, without the fine-tuning step, BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks. The code repository is available at https://github.com/Kari-Genomics-Lab/BarcodeBERT Comment: Main text: 5 pages, Total: 9 pages, 2 figures, accepted at the 4th Workshop on Self-Supervised Learning: Theory and Practice (NeurIPS 2023) |
نوع الوثيقة: | Working Paper |
URL الوصول: | http://arxiv.org/abs/2311.02401 |
رقم الأكسشن: | edsarx.2311.02401 |
قاعدة البيانات: | arXiv |
الوصف غير متاح. |