ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

A convergent off-policy temporal difference algorithm

Bharadwaj Diddigi, R and Kamanchi, C and Bhatnagar, S (2020) A convergent off-policy temporal difference algorithm. In: Frontiers in Artificial Intelligence and Applications, 29 August-8 September 2020, Online; Spain, pp. 1103-1110.

[img] PDF
faia_325_1103-1110.pdf - Published Version
Restricted to Registered users only

Download (532kB) | Request a copy
Official URL: https://dx.doi.org/10.3233/FAIA200207

Abstract

Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving the prediction problem. TD algorithms with linear function approximation are shown to be convergent when the samples are generated from the target policy (known as on-policy prediction). However, it has been well established in the literature that off-policy TD algorithms under linear function approximation may diverge. In this work, we propose a convergent on-line off-policy TD algorithm under linear function approximation. The main idea is to penalize the updates of the algorithm in a way as to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our algorithm. © 2020 The authors and IOS Press.

Item Type: Conference Paper
Publication: Frontiers in Artificial Intelligence and Applications
Publisher: IOS Press BV
Additional Information: cited By 0; Conference of 24th European Conference on Artificial Intelligence, ECAI 2020, including 10th Conference on Prestigious Applications of Artificial Intelligence, PAIS 2020 ; Conference Date: 29 August 2020 Through 8 September 2020; Conference Code:162625
Keywords: Approximation algorithms; Forecasting; Reinforcement learning, Behavior policy; Convergence analysis; Data sample; Linear functions; Prediction problem; Temporal difference learning; Temporal-difference algorithm; Value functions, Learning algorithms
Department/Centre: Division of Electrical Sciences > Computer Science & Automation
Date Deposited: 02 Dec 2020 10:13
Last Modified: 02 Dec 2020 10:13
URI: http://eprints.iisc.ac.in/id/eprint/66832

Actions (login required)

View Item View Item