دورية أكاديمية

Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing.

التفاصيل البيبلوغرافية
العنوان: Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing.
المؤلفون: Sprugnoli, Rachele, Moretti, Giovanni, Bentivogli, Luisa, Giuliani, Diego
المصدر: Language Resources & Evaluation; Jun2017, Vol. 51 Issue 2, p283-317, 35p
مصطلحات موضوعية: MULTILINGUAL communication, CROWDSOURCING, TRANSCRIPTION, ORTHOGRAPHIC projection, ORAL communication
مستخلص: This paper describes the development of a multilingual and multigenre manually annotated speech dataset, freely available to the research community as ground truth for the evaluation of automatic transcription systems and spoken language translation systems. The dataset includes two video genres-television broadcast news and talk-shows-and covers Flemish, English, German, and Italian, for a total of about 35 h of television speech. Besides segmentation and orthographic transcription, we added a very rich annotation on the audio signal, both at the linguistic level (e.g. filled pauses, pronunciation errors, disfluencies, speech in a foreign language) and at the acoustic level (e.g. background noise and different types of non-speech events). Furthermore, a subset of the transcriptions is translated in four directions, namely Flemish to English, German to English, German to Italian and English to Italian. The development of this dataset was organized in several phases, relying on expert transcribers as well as involving non-expert contributors through crowdsourcing. We first conducted a feasibility study to test and compare two methods for crowdsourcing speech transcription on broadcast news data. These methods are based on different transcription processes (i.e. parallel vs. iterative) and incorporate two different quality control mechanisms. With both methods, we achieved near-expert transcription quality-in terms of word error rate-for English, German and Italian data. Instead, for Flemish data we were not able to get a sufficient response from the crowd to complete the offered transcription tasks. The results obtained demonstrate that the viability of methods for crowdsourcing speech transcription significantly depends on the target language. This paper provides a detailed comparison of the results obtained with the two crowdsourcing methods tested, describes the main characteristics of the final ground truth resource created as well as the methodology adopted, and the guidelines prepared for its development. [ABSTRACT FROM AUTHOR]
Copyright of Language Resources & Evaluation is the property of Springer Nature and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
قاعدة البيانات: Complementary Index
الوصف
تدمد:1574020X
DOI:10.1007/s10579-016-9372-5