Off-Policy Average Reward Actor-Critic with Deterministic Policy Search

Saxena, N and Khastagir, S and Kolathaya, S and Bhatnagar, S (2023) Off-Policy Average Reward Actor-Critic with Deterministic Policy Search. In: Proceedings of Machine Learning Research, 23 - 29 July 2023, Honolulu, pp. 30130-30203.

PDF
ICML2023_202_30130-30203_2023.pdf - Published Version
Restricted to Registered users only
Download (4MB) | Request a copy

Official URL: https://arxiv.org/abs/2305.12239

Abstract

The average reward criterion is relatively less studied as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this work, we present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion. Using these theorems, we also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We first show asymptotic convergence analysis using the ODE-based method. Subsequently, we provide a finite time analysis of the resulting stochastic approximation scheme with linear function approximator and obtain an ϵ-optimal stationary policy with a sample complexity of Ω(ϵ−2.5). We compare the average reward performance of our proposed ARO-DDPG algorithm and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments. © 2023 Proceedings of Machine Learning Research. All rights reserved.

Item Type:	Conference Paper
Publication:	Proceedings of Machine Learning Research
Publisher:	ML Research Press
Additional Information:	The copyright for this article belongs to the ML Research Press
Keywords:	Approximation theory; Reinforcement learning, Actor critic; Actor-critic algorithm; Average reward; Average reward criteria; Deterministics; Discounted reward; Gradient algorithm; Policy gradient; Policy search; Reinforcement learnings, Stochastic systems
Department/Centre:	Division of Electrical Sciences > Computer Science & Automation Division of Interdisciplinary Sciences > Robert Bosch Centre for Cyber Physical Systems
Date Deposited:	17 Dec 2023 10:03
Last Modified:	17 Dec 2023 10:03
URI:	https://eprints.iisc.ac.in/id/eprint/83465

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India