تقرير
Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition
العنوان: | Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition |
---|---|
المؤلفون: | Hono, Yukiya, Mitsuda, Koh, Zhao, Tianyu, Mitsui, Kentaro, Wakatsuki, Toshiaki, Sawada, Kei |
سنة النشر: | 2023 |
المجموعة: | Computer Science |
مصطلحات موضوعية: | Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning |
الوصف: | Advances in machine learning have made it possible to perform various text and speech processing tasks, such as automatic speech recognition (ASR), in an end-to-end (E2E) manner. E2E approaches utilizing pre-trained models are gaining attention for conserving training data and resources. However, most of their applications in ASR involve only one of either a pre-trained speech or a language model. This paper proposes integrating a pre-trained speech representation model and a large language model (LLM) for E2E ASR. The proposed model enables the optimization of the entire ASR process, including acoustic feature extraction and acoustic and language modeling, by combining pre-trained models with a bridge network and also enables the application of remarkable developments in LLM utilization, such as parameter-efficient domain adaptation and inference optimization. Experimental results demonstrate that the proposed model achieves a performance comparable to that of modern E2E ASR models by utilizing powerful pre-training models with the proposed integrated approach. Comment: 17 pages, 4 figures, 9 tables, accepted for Findings of ACL 2024. The model is available at https://huggingface.co/rinna/nue-asr |
نوع الوثيقة: | Working Paper |
URL الوصول: | http://arxiv.org/abs/2312.03668 |
رقم الأكسشن: | edsarx.2312.03668 |
قاعدة البيانات: | arXiv |
الوصف غير متاح. |