Risk-sensitive Markov Decision Process and Learning under General Utility Functions

التفاصيل البيبلوغرافية
العنوان:	Risk-sensitive Markov Decision Process and Learning under General Utility Functions
المؤلفون:	Wu, Zhengqi, Xu, Renyuan
سنة النشر:	2023
المجموعة:	Computer Science Mathematics
مصطلحات موضوعية:	Computer Science - Machine Learning, Mathematics - Optimization and Control
الوصف:	Reinforcement Learning (RL) has gained substantial attention across diverse application domains and theoretical investigations. Existing literature on RL theory largely focuses on risk-neutral settings where the decision-maker learns to maximize the expected cumulative reward. However, in practical scenarios such as portfolio management and e-commerce recommendations, decision-makers often persist in heterogeneous risk preferences subject to outcome uncertainties, which can not be well-captured by the risk-neural framework. Incorporating these preferences can be approached through utility theory, yet the development of risk-sensitive RL under general utility functions remains an open question for theoretical exploration. In this paper, we consider a scenario where the decision-maker seeks to optimize a general utility function of the cumulative reward in the framework of a Markov decision process (MDP). To facilitate the Dynamic Programming Principle and Bellman equation, we enlarge the state space with an additional dimension that accounts for the cumulative reward. We propose a discretized approximation scheme to the MDP under enlarged state space, which is tractable and key for algorithmic design. We then propose a modified value iteration algorithm that employs an epsilon-covering over the space of cumulative reward. When a simulator is accessible, our algorithm efficiently learns a near-optimal policy with guaranteed sample complexity. In the absence of a simulator, our algorithm, designed with an upper-confidence-bound exploration approach, identifies a near-optimal policy while ensuring a guaranteed regret bound. For both algorithms, we match the theoretical lower bounds for the risk-neutral setting. Comment: 36 pages
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2311.13589
رقم الأكسشن:	edsarx.2311.13589
قاعدة البيانات:	arXiv

الوصف
الوصف غير متاح.