SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

التفاصيل البيبلوغرافية
العنوان: SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity
المؤلفون: Aynetdinov, Ansar, Akbik, Alan
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computation and Language
الوصف: Instruction-tuned Large Language Models (LLMs) have recently showcased remarkable advancements in their ability to generate fitting responses to natural language instructions. However, many current works rely on manual evaluation to judge the quality of generated responses. Since such manual evaluation is time-consuming, it does not easily scale to the evaluation of multiple models and model variants. In this short paper, we propose a straightforward but remarkably effective evaluation metric called SemScore, in which we directly compare model outputs to gold target responses using semantic textual similarity (STS). We conduct a comparative evaluation of the model outputs of 12 prominent instruction-tuned LLMs using 8 widely-used evaluation metrics for text generation. We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation. These findings indicate the utility of our proposed metric for the evaluation of instruction-tuned LLMs.
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2401.17072
رقم الأكسشن: edsarx.2401.17072
قاعدة البيانات: arXiv