MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts

التفاصيل البيبلوغرافية
العنوان: MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts
المؤلفون: Su, Zhenpeng, Lin, Zijia, Bai, Xue, Wu, Xing, Xiong, Yizhe, Lian, Haoran, Ma, Guangyuan, Chen, Hui, Ding, Guiguang, Zhou, Wei, Hu, Songlin
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computation and Language
الوصف: Scaling the size of a model enhances its capabilities but significantly increases computation complexity. Mixture-of-Experts models (MoE) address the issue by allowing model size to scale up without substantially increasing training or inference costs. In MoE, there is an important module called the router, which is used to distribute each token to the experts. Currently, the mainstream routing methods include dynamic routing and fixed routing. Despite their promising results, MoE models encounter several challenges. Primarily, for dynamic routing methods, the dispersion of training tokens across multiple experts can lead to underfitting, particularly for infrequent tokens. Additionally, though fixed routing methods can mitigate that issue, they compromise on the diversity of representations. In this paper, we propose \textbf{MaskMoE}, a method designed to enhance token-level learning by employing a routing \textbf{mask}ing technique within the \textbf{M}ixture-\textbf{o}f-\textbf{E}xperts model. MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training. Experimental results demonstrate that our method outperforms previous dominant Mixture-of-Experts models in terms of both perplexity (PPL) and downstream task performance.
Comment: Work in progress
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2407.09816
رقم الأكسشن: edsarx.2407.09816
قاعدة البيانات: arXiv