GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

التفاصيل البيبلوغرافية
العنوان:	GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
المؤلفون:	Costarelli, Anthony, Allen, Mat, Hauksson, Roman, Sodunke, Grace, Hariharan, Suhas, Cheng, Carlson, Li, Wenjie, Clymer, Joshua, Yadav, Arjun
سنة النشر:	2024
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Computation and Language, Computer Science - Artificial Intelligence
الوصف:	Large language models have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action. CoT and RAP both improve scores but not comparable to human levels.
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2406.06613
رقم الأكسشن:	edsarx.2406.06613
قاعدة البيانات:	arXiv

الوصف
الوصف غير متاح.