Multi-armed bandits with dependent arms

Singh, R and Liu, F and Sun, Y and Shroff, N (2023) Multi-armed bandits with dependent arms. In: Machine Learning .

Preview

PDF
Mac_lea_113_1_2024.pdf - Published Version
Download (8MB) | Preview

Official URL: https://doi.org/10.1007/s10994-023-06457-z

Abstract

We study a variant of the multi-armed bandit problem (MABP) which we call as MABs with dependent arms.Â Multiple arms are grouped together to form a cluster, and the reward distributions of arms in the same cluster are known functions of an unknown parameter that is a characteristic of the cluster.Â Thus, pulling an arm i not only reveals information about its own reward distribution, but also about all arms belonging to the same cluster. This â��correlationâ�� among the arms complicates the explorationâ��exploitation trade-off that is encountered in the MABP because the observation dependencies allow us to test simultaneously multiple hypotheses regarding the optimality of an arm.Â We develop learning algorithms based on the principle of optimism in the face of uncertainty (Lattimore and SzepesvÃ¡ri in Bandit algorithms, Cambridge University Press, 2020), which know the clusters, and hence utilize these additional side observations appropriately while performing explorationâ��exploitation trade-off. We show that the regret of our algorithms grows as O(Klog T) , where K is the number of clusters. In contrast, for an algorithm such as the vanilla UCB that does not utilize these dependencies, the regret scales as O(Mlog T) , where M is the number of arms. When Kâ�ª M , i.e. there is a lot of dependencies among arms, our proposed algorithm drastically reduces the dependence of regret on the number of arms. Â© 2023, The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature.

Item Type:	Journal Article
Publication:	Machine Learning
Publisher:	Springer
Additional Information:	The copyright for this article belongs to author.
Keywords:	Clustering algorithms; Decision making; E-learning; Economic and social effects, Clustered bandit; Exploration/exploitation; Multiarmed bandit problems (MABP); Multiarmed bandits (MABs); Multiple arms; Number of arms; Online learning; Sequential decision making; Structured bandit; Trade off, Learning algorithms
Department/Centre:	Division of Electrical Sciences > Electrical Communication Engineering
Date Deposited:	04 Mar 2024 09:27
Last Modified:	04 Mar 2024 09:27
URI:	https://eprints.iisc.ac.in/id/eprint/84304

Actions (login required)

View Item


	Powered by EPrints		A service from The J.R.D. Tata Memorial Library Indian Institute of Science, Bengaluru-560012, India