Towards Pareto Optimal Throughput in Small Language Model Serving

التفاصيل البيبلوغرافية
العنوان: Towards Pareto Optimal Throughput in Small Language Model Serving
المؤلفون: Recasens, Pol G., Zhu, Yue, Wang, Chen, Lee, Eun Kyung, Tardieu, Olivier, Youssef, Alaa, Torres, Jordi, Berral, Josep Ll.
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computation and Language
الوصف: Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.
Comment: It is going to be published at EuroMLSys'24
نوع الوثيقة: Working Paper
DOI: 10.1145/3642970.3655832
URL الوصول: http://arxiv.org/abs/2404.03353
رقم الأكسشن: edsarx.2404.03353
قاعدة البيانات: arXiv