Alexa Teacher Model

التفاصيل البيبلوغرافية
العنوان: Alexa Teacher Model
المؤلفون: FitzGerald, Jack, Ananthakrishnan, Shankar, Arkoudas, Konstantine, Bernardi, Davide, Bhagia, Abhishek, Bovi, Claudio Delli, Cao, Jin, Chada, Rakesh, Chauhan, Amit, Chen, Luoxin, Dwarakanath, Anurag, Dwivedi, Satyam, Gojayev, Turan, Gopalakrishnan, Karthik, Gueudre, Thomas, Hakkani-Tur, Dilek, Hamza, Wael, Hueser, Jonathan, Jose, Kevin Martin, Khan, Haidar, Liu, Beiye, Lu, Jianhua, Manzotti, Alessandro, Natarajan, Pradeep, Owczarzak, Karolina, Oz, Gokmen, Palumbo, Enrico, Peris, Charith, Prakash, Chandana Satya, Rawls, Stephen, Rosenbaum, Andy, Shenoy, Anjali, Soltan, Saleh, Sridhar, Mukund Harakere, Tan, Liz, Triefenbach, Fabian, Wei, Pan, Yu, Haiyang, Zheng, Shuai, Tur, Gokhan, Natarajan, Prem
المصدر: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
بيانات النشر: ACM, 2022.
سنة النشر: 2022
مصطلحات موضوعية: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, I.2.7, Computation and Language (cs.CL), Machine Learning (cs.LG)
الوصف: We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.
Comment: KDD 2022
URL الوصول: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::3be35e4212a3f3d4bb494e7d071b86ca
https://doi.org/10.1145/3534678.3539173
حقوق: OPEN
رقم الأكسشن: edsair.doi.dedup.....3be35e4212a3f3d4bb494e7d071b86ca
قاعدة البيانات: OpenAIRE