Risk-sensitive Markov Decision Process and Learning under General Utility Functions

التفاصيل البيبلوغرافية
العنوان: Risk-sensitive Markov Decision Process and Learning under General Utility Functions
المؤلفون: Wu, Zhengqi, Xu, Renyuan
سنة النشر: 2023
المجموعة: Computer Science
Mathematics
مصطلحات موضوعية: Computer Science - Machine Learning, Mathematics - Optimization and Control
الوصف: Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations. Existing literature on RL theory largely focuses on risk-neutral settings where the decision-maker learns to maximize the expected cumulative reward. However, in practical scenarios such as portfolio management and e-commerce recommendations, decision-makers often persist in heterogeneous risk preferences subject to outcome uncertainties, which can not be well-captured by the risk-neural framework. Incorporating these preferences can be approached through utility theory, yet the development of risk-sensitive RL under general utility functions remains an open question for theoretical exploration. In this paper, we consider a scenario where the decision-maker seeks to optimize a general utility function of the cumulative reward in the framework of a Markov decision process (MDP). To facilitate the Dynamic Programming Principle and Bellman equation, we enlarge the state space with an additional dimension that accounts for the cumulative reward. We propose a discretized approximation scheme to the MDP under enlarged state space, which is tractable and key for algorithmic design. We then propose a modified value iteration algorithm that employs an epsilon-covering over the space of cumulative reward. When a simulator is accessible, our algorithm efficiently learns a near-optimal policy with guaranteed sample complexity. In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy while ensuring a guaranteed regret bound. For both algorithms, we match the theoretical lower bounds for the risk-neutral setting.
Comment: 36 pages
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2311.13589
رقم الأكسشن: edsarx.2311.13589
قاعدة البيانات: arXiv