$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

التفاصيل البيبلوغرافية
العنوان:	$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
المؤلفون:	Yao, Shunyu, Shinn, Noah, Razavi, Pedram, Narasimhan, Karthik
سنة النشر:	2024
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Artificial Intelligence, Computer Science - Computation and Language
الوصف:	Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably.
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2406.12045
رقم الأكسشن:	edsarx.2406.12045
قاعدة البيانات:	arXiv

كن أول من يترك تعليقا!