Thompson Sampling for Non-Stationary Bandit Problems

Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationa...

Full description

Saved in:
Bibliographic Details
Main Authors: Han Qi, Fei Guo, Li Zhu
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/27/1/51
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationary settings, there is currently no regret bound analysis for TS with uninformative priors. To address this, we propose two algorithms, discounted TS and sliding-window TS, designed for sub-Gaussian reward distributions. For these algorithms, we establish an upper bound for the expected regret by bounding the expected number of times a suboptimal arm is played. We show that the regret upper bounds of both algorithms are <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi>O</mi><mo stretchy="false">~</mo></mover><mrow><mo stretchy="false">(</mo><msqrt><mrow><mi>T</mi><msub><mi>B</mi><mi>T</mi></msub></mrow></msqrt><mo stretchy="false">)</mo></mrow></mrow></semantics></math></inline-formula>, where <i>T</i> is the time horizon and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>B</mi><mi>T</mi></msub></semantics></math></inline-formula> is the number of breakpoints. This upper bound matches the lower bound for abruptly changing problems up to a logarithmic factor. Empirical comparisons with other non-stationary bandit algorithms highlight the competitive performance of our proposed methods.
ISSN:1099-4300