Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation.

التفاصيل البيبلوغرافية
العنوان:	Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation.
المؤلفون:	Ke-Ming Lyu, Ren-yuan Lyu, Hsien-Tsung Chang
المصدر:	PeerJ Computer Science; Mar2024, p1-19, 19p
مصطلحات موضوعية:	AUTOMATIC speech recognition, SPEECH perception, SPEECH, LINGUISTIC context, TELEVISION talk programs, ERROR rates
الشركة/الكيان:	OPENAI Inc.
مستخلص:	This research presents the development of a cutting-edge real-time multilingual speech recognition and speaker diarization system that leverages OpenAI's Whisper model. The system specifically addresses the challenges of automatic speech recognition (ASR) and speaker diarization (SD) in dynamic, multispeaker environments, with a focus on accurately processing Mandarin speech with Taiwanese accents and managing frequent speaker switches. Traditional speech recognition systems often fall short in such complex multilingual and multispeaker contexts, particularly in SD. This study, therefore, integrates advanced speech recognition with speaker diarization techniques optimized for real-time applications. These optimizations include handling model outputs efficiently and incorporating speaker embedding technology. The system was evaluated using data from Taiwanese talk shows and political commentary programs, featuring 46 diverse speakers. The results showed a promising word diarization error rate (WDER) of 2.68% in two-speaker scenarios and 11.65% in three-speaker scenarios, with an overall WDER of 6.96%. This performance is comparable to that of non-real-time baseline models, highlighting the system's ability to adapt to various complex conversational dynamics, a significant advancement in the field of real-time multilingual speech processing. [ABSTRACT FROM AUTHOR]
	Copyright of PeerJ Computer Science is the property of PeerJ Inc. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
قاعدة البيانات:	Complementary Index

Find this article in full text from ProQuest

Full Text Finder

الوصف
تدمد:	23765992
DOI:	10.7717/peerj-cs.1973