Thompson Sampling for Non-Stationary Bandit Problems
Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationa...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-01-01
|
Series: | Entropy |
Subjects: | |
Online Access: | https://www.mdpi.com/1099-4300/27/1/51 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832588595945275392 |
---|---|
author | Han Qi Fei Guo Li Zhu |
author_facet | Han Qi Fei Guo Li Zhu |
author_sort | Han Qi |
collection | DOAJ |
description | Non-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationary settings, there is currently no regret bound analysis for TS with uninformative priors. To address this, we propose two algorithms, discounted TS and sliding-window TS, designed for sub-Gaussian reward distributions. For these algorithms, we establish an upper bound for the expected regret by bounding the expected number of times a suboptimal arm is played. We show that the regret upper bounds of both algorithms are <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi>O</mi><mo stretchy="false">~</mo></mover><mrow><mo stretchy="false">(</mo><msqrt><mrow><mi>T</mi><msub><mi>B</mi><mi>T</mi></msub></mrow></msqrt><mo stretchy="false">)</mo></mrow></mrow></semantics></math></inline-formula>, where <i>T</i> is the time horizon and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>B</mi><mi>T</mi></msub></semantics></math></inline-formula> is the number of breakpoints. This upper bound matches the lower bound for abruptly changing problems up to a logarithmic factor. Empirical comparisons with other non-stationary bandit algorithms highlight the competitive performance of our proposed methods. |
format | Article |
id | doaj-art-fbe34e49f6404c758bb38029f04a19d3 |
institution | Kabale University |
issn | 1099-4300 |
language | English |
publishDate | 2025-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Entropy |
spelling | doaj-art-fbe34e49f6404c758bb38029f04a19d32025-01-24T13:31:49ZengMDPI AGEntropy1099-43002025-01-012715110.3390/e27010051Thompson Sampling for Non-Stationary Bandit ProblemsHan Qi0Fei Guo1Li Zhu2School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaSchool of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaSchool of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, ChinaNon-stationary multi-armed bandit (MAB) problems have recently attracted extensive attention. We focus on the abruptly changing scenario where reward distributions remain constant for a certain period and change at unknown time steps. Although Thompson sampling (TS) has shown success in non-stationary settings, there is currently no regret bound analysis for TS with uninformative priors. To address this, we propose two algorithms, discounted TS and sliding-window TS, designed for sub-Gaussian reward distributions. For these algorithms, we establish an upper bound for the expected regret by bounding the expected number of times a suboptimal arm is played. We show that the regret upper bounds of both algorithms are <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mover accent="true"><mi>O</mi><mo stretchy="false">~</mo></mover><mrow><mo stretchy="false">(</mo><msqrt><mrow><mi>T</mi><msub><mi>B</mi><mi>T</mi></msub></mrow></msqrt><mo stretchy="false">)</mo></mrow></mrow></semantics></math></inline-formula>, where <i>T</i> is the time horizon and <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>B</mi><mi>T</mi></msub></semantics></math></inline-formula> is the number of breakpoints. This upper bound matches the lower bound for abruptly changing problems up to a logarithmic factor. Empirical comparisons with other non-stationary bandit algorithms highlight the competitive performance of our proposed methods.https://www.mdpi.com/1099-4300/27/1/51multi-armed banditsThompson samplingnon-stationary |
spellingShingle | Han Qi Fei Guo Li Zhu Thompson Sampling for Non-Stationary Bandit Problems Entropy multi-armed bandits Thompson sampling non-stationary |
title | Thompson Sampling for Non-Stationary Bandit Problems |
title_full | Thompson Sampling for Non-Stationary Bandit Problems |
title_fullStr | Thompson Sampling for Non-Stationary Bandit Problems |
title_full_unstemmed | Thompson Sampling for Non-Stationary Bandit Problems |
title_short | Thompson Sampling for Non-Stationary Bandit Problems |
title_sort | thompson sampling for non stationary bandit problems |
topic | multi-armed bandits Thompson sampling non-stationary |
url | https://www.mdpi.com/1099-4300/27/1/51 |
work_keys_str_mv | AT hanqi thompsonsamplingfornonstationarybanditproblems AT feiguo thompsonsamplingfornonstationarybanditproblems AT lizhu thompsonsamplingfornonstationarybanditproblems |