Machine Learning based Language Modelling of Code Switched Data

التفاصيل البيبلوغرافية
العنوان: Machine Learning based Language Modelling of Code Switched Data
المؤلفون: Vallabh Patil, Shubham Pasari, Vaibhav Kumar, Sumedha Seniaray
المصدر: 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC).
بيانات النشر: IEEE, 2020.
سنة النشر: 2020
مصطلحات موضوعية: Hindi, Computer science, business.industry, media_common.quotation_subject, Feature extraction, Universal language, Ambiguity, computer.software_genre, language.human_language, Variation (linguistics), language, Task analysis, Social media, Language model, Artificial intelligence, business, computer, Natural language processing, media_common
الوصف: With the rapid increase of internet users all over the world, social media platforms have risen at a tremendous pace. Code-switched languages (When the speaker alternates between two or more languages eg. Hinglish, Hindi words written in English) are a popular medium of communication on social media. They are characterized by the lack of grammatical structure and variation in spellings. These linguistic constraints combined with lack of data cause ambiguity making the task of text classification on code-switched data difficult. In this paper, we have proposed a Language Modelling (LM) based approach to text classification of Hinglish text. We approach this problem by building a Universal Language Model Fine-tuning using AWD-LSTM architecture on a Hindi-English code-switched (Hinglish) corpus collected from various blogging sites. The language model is able to encode important information about the code-switched data and can be quickly fine-tuned on a given Hinglish dataset and achieve good results. We evaluated the performance of our model on the code-switched aggression detection TRAC-1 dataset, Hinglish Offensive Tweet (HOT) dataset and humour-classification dataset. Experiments on these datasets using our proposed method were able to surpass the previously reported results.
URL الوصول: https://explore.openaire.eu/search/publication?articleId=doi_________::ea3005392f0d85be87fc40f87d872d3e
https://doi.org/10.1109/icesc48915.2020.9155695
حقوق: CLOSED
رقم الأكسشن: edsair.doi...........ea3005392f0d85be87fc40f87d872d3e
قاعدة البيانات: OpenAIRE