Natural actor-critic algorithms

Bhatnagar, Shalabh and Sutton, Richard S and Ghavamzadeh, Mohammad and Lee, Mark (2009) Natural actor-critic algorithms. In: Automatica, 45 (11). pp. 2471-2482.

PDF
nips07.pdf - Published Version
Restricted to Registered users only
Download (135kB) | Request a copy

Official URL: http://www.sciencedirect.com/science?_ob=ArticleUR...

Abstract

We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and functi approximation ideas,and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function-approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms.

Item Type:	Journal Article
Publication:	Automatica
Publisher:	Elsevier Science
Additional Information:	Copyright of this article belongs to Elsevier Science.
Keywords:	Actor-critic reinforcement learning algorithms;Policy-gradient methods;Approximate dynamic programming;Function approximation;Two-timescale stochastic approximation;Temporal difference learning;Natural gradient
Department/Centre:	Division of Electrical Sciences > Computer Science & Automation
Date Deposited:	12 Jan 2010 10:45
Last Modified:	19 Sep 2010 05:53
URI:	http://eprints.iisc.ac.in/id/eprint/25173

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India