Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

التفاصيل البيبلوغرافية
العنوان: Optimistic Model Rollouts for Pessimistic Offline Policy Optimization
المؤلفون: Zhai, Yuanzhao, Li, Yiying, Gao, Zijian, Gong, Xudong, Xu, Kele, Feng, Dawei, Bo, Ding, Wang, Huaimin
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Machine Learning
الوصف: Model-based offline reinforcement learning (RL) has made remarkable progress, offering a promising avenue for improving generalization with synthetic model rollouts. Existing works primarily focus on incorporating pessimism for policy optimization, usually via constructing a Pessimistic Markov Decision Process (P-MDP). However, the P-MDP discourages the policies from learning in out-of-distribution (OOD) regions beyond the support of offline datasets, which can under-utilize the generalization ability of dynamics models. In contrast, we propose constructing an Optimistic MDP (O-MDP). We initially observed the potential benefits of optimism brought by encouraging more OOD rollouts. Motivated by this observation, we present ORPO, a simple yet effective model-based offline RL framework. ORPO generates Optimistic model Rollouts for Pessimistic offline policy Optimization. Specifically, we train an optimistic rollout policy in the O-MDP to sample more OOD model rollouts. Then we relabel the sampled state-action pairs with penalized rewards and optimize the output policy in the P-MDP. Theoretically, we demonstrate that the performance of policies trained with ORPO can be lower-bounded in linear MDPs. Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely-used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization.
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2401.05899
رقم الأكسشن: edsarx.2401.05899
قاعدة البيانات: arXiv