NLPBench: Evaluating Large Language Models on Solving NLP Problems

التفاصيل البيبلوغرافية
العنوان: NLPBench: Evaluating Large Language Models on Solving NLP Problems
المؤلفون: Song, Linxin, Zhang, Jieyu, Cheng, Lechao, Zhou, Pengyuan, Zhou, Tianyi, Li, Irene
سنة النشر: 2023
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computation and Language
الوصف: Recent developments in large language models (LLMs) have shown promise in enhancing the capabilities of natural language processing (NLP). Despite these successes, there remains a dearth of research dedicated to the NLP problem-solving abilities of LLMs. To fill the gap in this area, we present a unique benchmarking dataset, NLPBench, comprising 378 college-level NLP questions spanning various NLP topics sourced from Yale University's prior final exams. NLPBench includes questions with context, in which multiple sub-questions share the same public information, and diverse question types, including multiple choice, short answer, and math. Our evaluation, centered on LLMs such as GPT-3.5/4, PaLM-2, and LLAMA-2, incorporates advanced prompting strategies like the chain-of-thought (CoT) and tree-of-thought (ToT). Our study reveals that the effectiveness of the advanced prompting strategies can be inconsistent, occasionally damaging LLM performance, especially in smaller models like the LLAMA-2 (13b). Furthermore, our manual assessment illuminated specific shortcomings in LLMs' scientific problem-solving skills, with weaknesses in logical decomposition and reasoning notably affecting results.
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2309.15630
رقم الأكسشن: edsarx.2309.15630
قاعدة البيانات: arXiv