Lakshmanan, K and Bhatnagar, Shalabh (2012) A novel Q-learning algorithm with function approximation for constrained Markov decision processes. In: 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 1-5 Oct. 2012 , Monticello, IL, USA.
PDF
Ann_All_Con_Com_Con_Com_400_2012.pdf - Published Version Restricted to Registered users only Download (362kB) | Request a copy |
Abstract
We present a novel multi-timescale Q-learning algorithm for average cost control in a Markov decision process subject to multiple inequality constraints. We formulate a relaxed version of this problem through the Lagrange multiplier method. Our algorithm is different from Q-learning in that it updates two parameters - a Q-value parameter and a policy parameter. The Q-value parameter is updated on a slower time scale as compared to the policy parameter. Whereas Q-learning with function approximation can diverge in some cases, our algorithm is seen to be convergent as a result of the aforementioned timescale separation. We show the results of experiments on a problem of constrained routing in a multistage queueing network. Our algorithm is seen to exhibit good performance and the various inequality constraints are seen to be satisfied upon convergence of the algorithm.
Item Type: | Conference Paper |
---|---|
Publisher: | IEEE |
Additional Information: | Copyright of this article belongs to IEEE. |
Keywords: | Q-Learning with Linear Function Approximation; Constrained MDP; Lagrange Multiplier Method; Reinforcement Learning; Multi-Stage Stochastic Shortest Path Problem |
Department/Centre: | Division of Electrical Sciences > Computer Science & Automation |
Date Deposited: | 02 Jul 2013 08:37 |
Last Modified: | 02 Jul 2013 08:37 |
URI: | http://eprints.iisc.ac.in/id/eprint/46678 |
Actions (login required)
View Item |