Thompson Sampling for Non-Stationary Bandit Problems

Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationa...

Full description

Saved in:

Bibliographic Details
Main Authors:	Han Qi, Fei Guo, Li Zhu
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Entropy
Subjects:	multi-armed bandits Thompson sampling non-stationary
Online Access:	https://www.mdpi.com/1099-4300/27/1/51
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832588595945275392
author	Han Qi Fei Guo Li Zhu
author_facet	Han Qi Fei Guo Li Zhu
author_sort	Han Qi
collection	DOAJ
description	Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationary settings, there is currently no regret bound analysis for TS with uninformative priors. To address this, we propose two algorithms, discounted TS and sliding-window TS, designed for sub-Gaussian reward distributions. For these algorithms, we establish an upper bound for the expected regret by bounding the expected number of times a suboptimal arm is played. We show that the regret upper bounds of both algorithms are <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi>O</mi><mo stretchy="false">~</mo></mover><mrow><mo stretchy="false">(</mo><msqrt><mrow><mi>T</mi><msub><mi>B</mi><mi>T</mi></msub></mrow></msqrt><mo stretchy="false">)</mo></mrow></mrow></semantics></math></inline-formula>, where <i>T</i> is the time horizon and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>B</mi><mi>T</mi></msub></semantics></math></inline-formula> is the number of breakpoints. This upper bound matches the lower bound for abruptly changing problems up to a logarithmic factor. Empirical comparisons with other non-stationary bandit algorithms highlight the competitive performance of our proposed methods.
format	Article
id	doaj-art-fbe34e49f6404c758bb38029f04a19d3
institution	Kabale University
issn	1099-4300
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Entropy
spelling	doaj-art-fbe34e49f6404c758bb38029f04a19d32025-01-24T13:31:49ZengMDPI AGEntropy1099-43002025-01-012715110.3390/e27010051Thompson Sampling for Non-Stationary Bandit ProblemsHan Qi0Fei Guo1Li Zhu2School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaSchool of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaSchool of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaNon-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationary settings, there is currently no regret bound analysis for TS with uninformative priors. To address this, we propose two algorithms, discounted TS and sliding-window TS, designed for sub-Gaussian reward distributions. For these algorithms, we establish an upper bound for the expected regret by bounding the expected number of times a suboptimal arm is played. We show that the regret upper bounds of both algorithms are <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi>O</mi><mo stretchy="false">~</mo></mover><mrow><mo stretchy="false">(</mo><msqrt><mrow><mi>T</mi><msub><mi>B</mi><mi>T</mi></msub></mrow></msqrt><mo stretchy="false">)</mo></mrow></mrow></semantics></math></inline-formula>, where <i>T</i> is the time horizon and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>B</mi><mi>T</mi></msub></semantics></math></inline-formula> is the number of breakpoints. This upper bound matches the lower bound for abruptly changing problems up to a logarithmic factor. Empirical comparisons with other non-stationary bandit algorithms highlight the competitive performance of our proposed methods.https://www.mdpi.com/1099-4300/27/1/51multi-armed banditsThompson samplingnon-stationary
spellingShingle	Han Qi Fei Guo Li Zhu Thompson Sampling for Non-Stationary Bandit Problems Entropy multi-armed bandits Thompson sampling non-stationary
title	Thompson Sampling for Non-Stationary Bandit Problems
title_full	Thompson Sampling for Non-Stationary Bandit Problems
title_fullStr	Thompson Sampling for Non-Stationary Bandit Problems
title_full_unstemmed	Thompson Sampling for Non-Stationary Bandit Problems
title_short	Thompson Sampling for Non-Stationary Bandit Problems
title_sort	thompson sampling for non stationary bandit problems
topic	multi-armed bandits Thompson sampling non-stationary
url	https://www.mdpi.com/1099-4300/27/1/51
work_keys_str_mv	AT hanqi thompsonsamplingfornonstationarybanditproblems AT feiguo thompsonsamplingfornonstationarybanditproblems AT lizhu thompsonsamplingfornonstationarybanditproblems

Thompson Sampling for Non-Stationary Bandit Problems

Similar Items