دورية أكاديمية

TweetLID: a benchmark for tweet language identification.

التفاصيل البيبلوغرافية
العنوان: TweetLID: a benchmark for tweet language identification.
المؤلفون: Zubiaga, Arkaitz, Vicente, Iñaki, Gamallo, Pablo, Pichel, José, Alegria, Iñaki, Aranberri, Nora, Ezeiza, Aitzol, Fresno, Víctor
المصدر: Language Resources & Evaluation; Dec2016, Vol. 50 Issue 4, p729-766, 38p
مصطلحات موضوعية: LANGUAGE identification (Computational linguistics), NATURAL language processing, MULTILINGUALISM, LANGUAGE & languages, MICROBLOGS
مستخلص: Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (1) distinction of similar languages, (2) detection of multilingualism in a single document, and (3) identifying the language of short texts. In this paper, we describe our work on the development of a benchmark to encourage further research in these three directions, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes. We also describe the shared task we organized to validate and assess the evaluation framework and dataset with systems submitted by seven different participants, and analyze the performance of these systems. The evaluation of the results submitted by the participants of the shared task helped us shed some light on the shortcomings of state-of-the-art language identification systems, and gives insight into the extent to which the brevity, multilingualism, and language similarity found in texts exacerbate the performance of language identifiers. Our dataset with nearly 35,000 tweets and the evaluation framework provide researchers and practitioners with suitable resources to further study the aforementioned issues on language identification within a common setting that enables to compare results with one another. [ABSTRACT FROM AUTHOR]
Copyright of Language Resources & Evaluation is the property of Springer Nature and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
قاعدة البيانات: Complementary Index
الوصف
تدمد:1574020X
DOI:10.1007/s10579-015-9317-4