تقرير
End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation
العنوان: | End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation |
---|---|
المؤلفون: | Zuluaga-Gomez, Juan, Huang, Zhaocheng, Niu, Xing, Paturi, Rohit, Srinivasan, Sundararajan, Mathur, Prashant, Thompson, Brian, Federico, Marcello |
سنة النشر: | 2023 |
المجموعة: | Computer Science |
مصطلحات موضوعية: | Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing |
الوصف: | Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single- and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training. Comment: Accepted at EMNLP 2023. Code: https://github.com/amazon-science/stac-speech-translation |
نوع الوثيقة: | Working Paper |
URL الوصول: | http://arxiv.org/abs/2311.00697 |
رقم الأكسشن: | edsarx.2311.00697 |
قاعدة البيانات: | arXiv |
الوصف غير متاح. |