تقرير
Towards Pareto Optimal Throughput in Small Language Model Serving
العنوان: | Towards Pareto Optimal Throughput in Small Language Model Serving |
---|---|
المؤلفون: | Recasens, Pol G., Zhu, Yue, Wang, Chen, Lee, Eun Kyung, Tardieu, Olivier, Youssef, Alaa, Torres, Jordi, Berral, Josep Ll. |
سنة النشر: | 2024 |
المجموعة: | Computer Science |
مصطلحات موضوعية: | Computer Science - Computation and Language |
الوصف: | Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs. Comment: It is going to be published at EuroMLSys'24 |
نوع الوثيقة: | Working Paper |
DOI: | 10.1145/3642970.3655832 |
URL الوصول: | http://arxiv.org/abs/2404.03353 |
رقم الأكسشن: | edsarx.2404.03353 |
قاعدة البيانات: | arXiv |
DOI: | 10.1145/3642970.3655832 |
---|