Experiments with Detecting and Mitigating AI Deception

التفاصيل البيبلوغرافية
العنوان: Experiments with Detecting and Mitigating AI Deception
المؤلفون: Sahbane, Ismail, Ward, Francis Rhys, Åslund, C Henrik
سنة النشر: 2023
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Artificial Intelligence
الوصف: How to detect and mitigate deceptive AI systems is an open problem for the field of safe and trustworthy AI. We analyse two algorithms for mitigating deception: The first is based on the path-specific objectives framework where paths in the game that incentivise deception are removed. The second is based on shielding, i.e., monitoring for unsafe policies and replacing them with a safe reference policy. We construct two simple games and evaluate our algorithms empirically. We find that both methods ensure that our agent is not deceptive, however, shielding tends to achieve higher reward.
Comment: 4 pages, 2 figures, 3 algorithms, 1 table
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2306.14816
رقم الأكسشن: edsarx.2306.14816
قاعدة البيانات: arXiv