Dialect Corpora from YouTube

التفاصيل البيبلوغرافية
العنوان: Dialect Corpora from YouTube
المؤلفون: Coats, S. (Steven)
المصدر: Language and Linguistics in a Complex World ISBN: 9783111017433
بيانات النشر: De Gruyter, 2023.
سنة النشر: 2023
الوصف: This paper introduces two new large corpora comprised of YouTube Automatic Speech Recognition (ASR) transcripts of the speech of videos from geographically localized channels in the United States, Canada, and the British Isles, a promising resource for more in-depth study of regional language variation in spoken English. The procedure used to create the corpora bypasses the web API for YouTube, instead relying on web scraping and open-source scripts or software for the automatic identification and downloading of suitable channel content as well as dealing with the rate-limiting issues that arise thereby. In order to assess the accuracy of downloaded transcripts, word frequency statistics are compared for ASR and manual transcripts of city council meetings of Philadelphia, Pennsylvania, USA, and a transcript classification task is undertaken using vector- based distributed representations of transcript content. Despite errors, corpora of ASR transcripts may prove useful for the characterization and study of regional language variation, particularly when analytical techniques are employed that are relatively robust to low-frequency phenomena.
وصف الملف: application/pdf
ردمك: 978-3-11-101743-3
URL الوصول: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::f3a87b7544e5cb46d3116443ba31ea62
https://doi.org/10.1515/9783111017433-005
حقوق: OPEN
رقم الأكسشن: edsair.doi.dedup.....f3a87b7544e5cb46d3116443ba31ea62
قاعدة البيانات: OpenAIRE