Towards Pareto Optimal Throughput in Small Language Model Serving

التفاصيل البيبلوغرافية
العنوان:	Towards Pareto Optimal Throughput in Small Language Model Serving
المؤلفون:	Recasens, Pol G., Zhu, Yue, Wang, Chen, Lee, Eun Kyung, Tardieu, Olivier, Youssef, Alaa, Torres, Jordi, Berral, Josep Ll.
سنة النشر:	2024
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Computation and Language
الوصف:	Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs. Comment: It is going to be published at EuroMLSys'24
نوع الوثيقة:	Working Paper
DOI:	10.1145/3642970.3655832
URL الوصول:	http://arxiv.org/abs/2404.03353
رقم الأكسشن:	edsarx.2404.03353
قاعدة البيانات:	arXiv

الوصف
DOI:	10.1145/3642970.3655832