Thompson Sampling for Non-Stationary Bandit Problems

Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationa...

Full description

Saved in:
Bibliographic Details
Main Authors: Han Qi, Fei Guo, Li Zhu
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/27/1/51
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832588595945275392
author Han Qi
Fei Guo
Li Zhu
author_facet Han Qi
Fei Guo
Li Zhu
author_sort Han Qi
collection DOAJ
description Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationary settings, there is currently no regret bound analysis for TS with uninformative priors. To address this, we propose two algorithms, discounted TS and sliding-window TS, designed for sub-Gaussian reward distributions. For these algorithms, we establish an upper bound for the expected regret by bounding the expected number of times a suboptimal arm is played. We show that the regret upper bounds of both algorithms are <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi>O</mi><mo stretchy="false">~</mo></mover><mrow><mo stretchy="false">(</mo><msqrt><mrow><mi>T</mi><msub><mi>B</mi><mi>T</mi></msub></mrow></msqrt><mo stretchy="false">)</mo></mrow></mrow></semantics></math></inline-formula>, where <i>T</i> is the time horizon and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>B</mi><mi>T</mi></msub></semantics></math></inline-formula> is the number of breakpoints. This upper bound matches the lower bound for abruptly changing problems up to a logarithmic factor. Empirical comparisons with other non-stationary bandit algorithms highlight the competitive performance of our proposed methods.
format Article
id doaj-art-fbe34e49f6404c758bb38029f04a19d3
institution Kabale University
issn 1099-4300
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj-art-fbe34e49f6404c758bb38029f04a19d32025-01-24T13:31:49ZengMDPI AGEntropy1099-43002025-01-012715110.3390/e27010051Thompson Sampling for Non-Stationary Bandit ProblemsHan Qi0Fei Guo1Li Zhu2School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaSchool of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaSchool of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaNon-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationary settings, there is currently no regret bound analysis for TS with uninformative priors. To address this, we propose two algorithms, discounted TS and sliding-window TS, designed for sub-Gaussian reward distributions. For these algorithms, we establish an upper bound for the expected regret by bounding the expected number of times a suboptimal arm is played. We show that the regret upper bounds of both algorithms are <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi>O</mi><mo stretchy="false">~</mo></mover><mrow><mo stretchy="false">(</mo><msqrt><mrow><mi>T</mi><msub><mi>B</mi><mi>T</mi></msub></mrow></msqrt><mo stretchy="false">)</mo></mrow></mrow></semantics></math></inline-formula>, where <i>T</i> is the time horizon and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>B</mi><mi>T</mi></msub></semantics></math></inline-formula> is the number of breakpoints. This upper bound matches the lower bound for abruptly changing problems up to a logarithmic factor. Empirical comparisons with other non-stationary bandit algorithms highlight the competitive performance of our proposed methods.https://www.mdpi.com/1099-4300/27/1/51multi-armed banditsThompson samplingnon-stationary
spellingShingle Han Qi
Fei Guo
Li Zhu
Thompson Sampling for Non-Stationary Bandit Problems
Entropy
multi-armed bandits
Thompson sampling
non-stationary
title Thompson Sampling for Non-Stationary Bandit Problems
title_full Thompson Sampling for Non-Stationary Bandit Problems
title_fullStr Thompson Sampling for Non-Stationary Bandit Problems
title_full_unstemmed Thompson Sampling for Non-Stationary Bandit Problems
title_short Thompson Sampling for Non-Stationary Bandit Problems
title_sort thompson sampling for non stationary bandit problems
topic multi-armed bandits
Thompson sampling
non-stationary
url https://www.mdpi.com/1099-4300/27/1/51
work_keys_str_mv AT hanqi thompsonsamplingfornonstationarybanditproblems
AT feiguo thompsonsamplingfornonstationarybanditproblems
AT lizhu thompsonsamplingfornonstationarybanditproblems