Case 1 | Case 2 | |||||
$ \theta $ | $ r_{l} $ | $ r_{u} $ | $ \theta $ | $ r_{l} $ | $ r_{u} $ | |
Arm 1 | 0.001 | 0.0 | 1.0 | 0.001 | 0.3 | 1.0 |
Arm 2 | 0.010 | 0.0 | 1.0 | 0.010 | 0.0 | 0.7 |
The multi-armed bandit (MAB) problem is a classic example of the exploration-exploitation dilemma. It is concerned with maximising the total rewards for a gambler by sequentially pulling an arm from a multi-armed slot machine where each arm is associated with a reward distribution. In static MABs, the reward distributions do not change over time, while in dynamic MABs, each arm's reward distribution can change, and the optimal arm can switch over time. Motivated by many real applications where rewards are binary, we focus on dynamic Bernoulli bandits. Standard methods like $ \epsilon $-Greedy and Upper Confidence Bound (UCB), which rely on the sample mean estimator, often fail to track changes in the underlying reward for dynamic problems. In this paper, we overcome the shortcoming of slow response to change by deploying adaptive estimation in the standard methods and propose a new family of algorithms, which are adaptive versions of $ \epsilon $-Greedy, UCB, and Thompson sampling. These new methods are simple and easy to implement. Moreover, they do not require any prior knowledge about the dynamic reward process, which is important for real applications. We examine the new algorithms numerically in different scenarios and the results show solid improvements of our algorithms in dynamic environments.
Citation: |
Figure 3. Abruptly changing scenario (Case 1): examples of $ \mu_{t} $ sampled from the model in (28) with parameters of Case 1 displayed in Table 1
Figure 4. Abruptly changing scenario (Case 2): examples of $ \mu_{t} $ sampled from the model in (28) with parameters of Casek 2 displayed in Table 1
Figure 5. Results for the two-armed Bernoulli bandit with abruptly changing expected rewards. The top row displays the cumulative regret over time; results are averaged over 100 replications. The bottom row are boxplots of total regret at time $ t = 10,000 $. Trajectories are sampled from (28) with parameters displayed in Table 1
Figure 8. Results for the two-armed Bernoulli bandit with drifting expected rewards. The top row displays the cumulative regret over time; results are averaged over 100 independent replications. The bottom row are boxplots of total regret at time $ t = 10,000 $. Trajectories for Case 3 are sampled from (29) with $ \sigma^{2}_{\mu} = 0.0001 $, and trajectories for Case 4 are sampled from (30) with $ \sigma^{2}_{\mu} = 0.001 $
Figure 17. Boxplot of total regret for algorithms DTS, AFF-DTS1, and AFF-DTS2. Acronym like DTS-C5 represents the DTS algorithm with parameter $ C = 5 $. Similarly acronym like AFF-DTS1-C5 represents the AFF-DTS1 algorithm with initial value $ C_{0} = 5 $. The result of AFF-OTS is plotted as a benchmark
Table 1. Parameters used in the exponential clock model shown in (28)
Case 1 | Case 2 | |||||
$ \theta $ | $ r_{l} $ | $ r_{u} $ | $ \theta $ | $ r_{l} $ | $ r_{u} $ | |
Arm 1 | 0.001 | 0.0 | 1.0 | 0.001 | 0.3 | 1.0 |
Arm 2 | 0.010 | 0.0 | 1.0 | 0.010 | 0.0 | 0.7 |
[1] |
C. Anagnostopoulos, D. K. Tasoulis, N. M. Adams, N. G. Pavlidis and D. J. Hand, Online linear and quadratic discriminant analysis with adaptive forgetting for streaming classification, Statistical Analysis and Data Mining: The ASA Data Science Journal, 5 (2012), 139-166.
doi: 10.1002/sam.10151.![]() ![]() ![]() |
[2] |
P. Auer, N. Cesa-Bianchi and P. Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning, 47 (2002), 235-256.
![]() |
[3] |
P. Auer, N. Cesa-Bianchi, Y. Freund and R. E. Schapire, The non-stochastic multi-armed bandit problem, SIAM Journal on Computing, 32 (2002), 48-77.
doi: 10.1137/S0097539701398375.![]() ![]() ![]() |
[4] |
B. Awerbuch and R. Kleinberg, Online linear optimization and adaptive routing, Journal of Computer and System Sciences, 74 (2008), 97-114.
doi: 10.1016/j.jcss.2007.04.016.![]() ![]() ![]() |
[5] |
D. A. Bodenham and N. M. Adams, Continuous monitoring for changepoints in data streams using adaptive estimation, Statistics and Computing, 27 (2017), 1257-1270.
doi: 10.1007/s11222-016-9684-8.![]() ![]() ![]() |
[6] |
E. Brochu, M. D. Hoffman and N. de Freitas, Portfolio allocation for Bayesian optimization, preprint, arXiv: 1009.5419v2.
![]() |
[7] |
O. Chapelle and L. Li, An empirical evaluation of Thompson sampling, in Advances in Neural Information Processing Systems 24, Curran Associates, Inc., (2011), 2249–2257.
![]() |
[8] |
A. Garivier and O. Cappe, The KL-UCB algorithm for bounded stochastic bandits and beyond, in Proceedings of the 24th Annual Conference on Learning Theory, vol. 19 of PMLR, (2011), 359–376.
![]() |
[9] |
A. Garivier and E. Moulines, On upper-confidence bound policies for switching bandit problems, in Algorithmic Learning Theory, vol. 6925 of Lecture Notes in Artificial Intelligence, Springer-Verlag Berlin, (2011), 174–188.
doi: 10.1007/978-3-642-24412-4_16.![]() ![]() ![]() |
[10] |
P. W. Glynn and D. Ormoneit, Hoeffding's inequality for uniformly ergodic Markov chains, Statistics and Probability Letters, 56 (2002), 143-146.
doi: 10.1016/S0167-7152(01)00158-4.![]() ![]() ![]() |
[11] |
O.-C. Granmo and S. Berg, Solving non-stationary bandit problems by random sampling from sibling Kalman filters, in Proceedings of Trends in Applied Intelligent Systems, PT III, vol. 6098 of Lecture Notes in Artificial Intelligence, Springer-Verlag Berlin, (2010), 199–208.
![]() |
[12] |
N. Gupta, O.-C. Granmo and A. Agrawala, Thompson sampling for dynamic multi-armed bandits, in Proceedings of the 10th International Conference on Machine Learning and Applications and Workshops, (2011), 484–489.
![]() |
[13] |
S. S. Haykin, Adaptive Filter Theory, 4th edition, Prentice-Hall, Upper Saddle River, N.J., 2002.
![]() |
[14] |
W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, 58 (1963), 13-30.
![]() ![]() |
[15] |
L. Kocsis and C. Szepesvari, Discounted UCB, in 2nd PASCAL Challenges Workshop, Venice, 2006. Available from: https://www.lri.fr/ sebag/Slides/Venice/Kocsis.pdf.
![]() |
[16] |
D. E. Koulouriotis and A. Xanthopoulos, Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems, Applied Mathematics and Computation, 196 (2008), 913-922.
![]() |
[17] |
V. Kuleshov and D. Precup, Algorithms for the multi-armed bandit problem, preprint, arXiv: 1402.6028v1.
![]() |
[18] |
J. Langford and T. Zhang, The epoch-greedy algorithm for multi-armed bandits with side information, in Advances in Neural Information Processing Systems 20, Curran Associates, Inc., (2008), 817–824.
![]() |
[19] |
N. Levine, K. Crammer and S. Mannor, Rotting bandits, in Advances in Neural Information Processing Systems 30, Curran Associates, Inc., (2017) 3074–3083.
![]() |
[20] |
L. Li, W. Chu, J. Langford and R. E. Schapire, A contextual-bandit approach to personalized news article recommendation, in Proceedings of the 19th International Conference on World Wide Web, ACM, (2010), 661–670.
![]() |
[21] |
B. C. May, N. Korda, A. Lee and D. S. Leslie, Optimistic Bayesian sampling in contextual-bandit problems, The Journal of Machine Learning Research, 13 (2012), 2069-2106.
![]() ![]() |
[22] |
C. H. Papadimitriou and J. N. Tsitsiklis, The complexity of optimal queuing network control, Mathematics of Operations Research, 24 (1999), 293-305.
doi: 10.1287/moor.24.2.293.![]() ![]() ![]() |
[23] |
W. H. Press, Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research, Proceedings of the National Academy of Sciences of the United States of America, 106 (2009), 22387-22392.
![]() |
[24] |
V. Raj and S. Kalyani, Taming non-stationary bandits: A Bayesian approach, arXiv: 1707.09727.
![]() |
[25] |
H. Robbins, Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, 58 (1952), 527-535.
doi: 10.1090/S0002-9904-1952-09620-8.![]() ![]() ![]() |
[26] |
S. W. Roberts, Control chart tests based on geometric moving averages, Technometrics, 1 (1959), 239-250.
![]() |
[27] |
S. L. Scott, A modern Bayesian look at the multi-armed bandit, Applied Stochastic Models in Business and Industry, 26 (2010), 639-658.
doi: 10.1002/asmb.874.![]() ![]() ![]() |
[28] |
S. L. Scott, Multi-armed bandit experiments in the online service economy, Applied Stochastic Models in Business and Industry, 31 (2015), 37-45.
doi: 10.1002/asmb.2104.![]() ![]() ![]() |
[29] |
W. Shen, J. Wang, Y.-G. Jiang and H. Zha, Portfolio choices with orthogonal bandit learning, in Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, (2015), 974–980.
![]() |
[30] |
A. Slivkins and E. Upfal, Adapting to a changing environment: The Brownian restless bandits, in 21st Conference on Learning Theory, (2008), 343–354.
![]() |
[31] |
W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, 25 (1933), 285-294.
![]() |
[32] |
F. Tsung and K. Wang, Adaptive charting techniques: Literature review and extensions, in Frontiers in Statistical Quality Control 9, Physica-Verlag HD, Heidelberg, (2010), 19–35,.
![]() |
[33] |
J. Vermorel and M. Mohri, Multi-armed bandit algorithms and empirical evaluation, in Proceedings of the 16th European Conference on Machine Learning, vol. 3720 of Lecture Notes in Computer Science, Springer, Berlin, (2005), 437–448.
![]() |
[34] |
S. S. Villar, J. Bowden and J. Wason, Multi-armed bandit models for the optimal design of clinical trials: Benefits and challenges, Statistical Science, 30 (2015), 199-215.
doi: 10.1214/14-STS504.![]() ![]() ![]() |
[35] |
C. J. C. H. Watkins, Learning from Delayed Rewards, Ph.D thesis, Cambridge University, 1989.
![]() |
[36] |
P. Whittle, Restless bandits: Activity allocation in a changing world, Journal of Applied Probability, 25 (1988), 287-298.
doi: 10.1017/s0021900200040420.![]() ![]() ![]() |
[37] |
J. Y. Yu and S. Mannor, Piecewise-stationary bandit problems with side observations, in Proceedings of the 26th International Conference on Machine Learning, (2009), 1177–1184.
![]() |
Illustration of the difference between tuning
Performance of different algorithms in the case of small number of changes
Abruptly changing scenario (Case 1): examples of
Abruptly changing scenario (Case 2): examples of
Results for the two-armed Bernoulli bandit with abruptly changing expected rewards. The top row displays the cumulative regret over time; results are averaged over 100 replications. The bottom row are boxplots of total regret at time
Drifting scenario (Case 3): examples of
Drifting scenario (Case 4): examples of
Results for the two-armed Bernoulli bandit with drifting expected rewards. The top row displays the cumulative regret over time; results are averaged over 100 independent replications. The bottom row are boxplots of total regret at time
Large number of arms: abruptly changing environment (Case 1)
Large number of arms: abruptly changing environment (Case 2)
Large number of arms: drifting environment (Case 3)
Large number of arms: drifting environment (Case 4)
AFF-
AFF versions of UCB algorithm with different
AFF versions of TS algorithm with different
D-UCB and SW-UCB algorithms with different values of key parameters
Boxplot of total regret for algorithms DTS, AFF-DTS1, and AFF-DTS2. Acronym like DTS-C5 represents the DTS algorithm with parameter