GlossLM: Multilingual Pretraining for Low-Resource Interlinear Glossing

التفاصيل البيبلوغرافية
العنوان: GlossLM: Multilingual Pretraining for Low-Resource Interlinear Glossing
المؤلفون: Ginn, Michael, Tjuatja, Lindia, He, Taiqi, Rice, Enora, Neubig, Graham, Palmer, Alexis, Levin, Lori
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computation and Language
الوصف: Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few existing resources providing large amounts of standardized, easily accessible IGT data, limiting their applicability to linguistic research, and making it difficult to use such data in NLP modeling. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. We normalize much of our data to follow a standard set of labels across languages. Furthermore, we explore the task of automatically generating IGT in order to aid documentation projects. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6%. We will make our pretrained model and dataset available through Hugging Face, as well as provide access through a web interface for use in language documentation efforts.
Comment: 19 pages, 7 figures Submitted to ACL ARR June 2024. First two authors are equal contribution
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2403.06399
رقم الأكسشن: edsarx.2403.06399
قاعدة البيانات: arXiv