Projects for IE617

Contextual Bandits

In many learning settings, the optimal action in a round can depend on the current context or observed features. For example, if an ad related to sports shoe or an ad related to retirment plan has to be displayed on a webpage and if we know that the current user is a young male (context), we are better off by displaying the sports shoe ad. Naturally, such setting have found applications in ad placements and recommendation systems.

A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire: ``Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits,'' ICML 2014 [Paper]

L. Zhou: ``A Survey on Contextual Multi-armed Bandits," 2015 [Paper]

Stochastic linearly parametrized bandits

The standard stochastic multi-armed bandit setting considers that the arms are independent. The regret bounds in this setting grows linearly in the number of arms and is not useful when the number of arms grows large. However, the assumption of indepndent arms is strong in many applications. For example, in revenue mangement, advertisements, etc. playing an arm revelas reward information about other 'similar' arms. The stochastic linear bandit setting considers that each arm is described by a vector (features) and its mean rewards is a linear function (unknown) of the features.

V. Dani, T. P. Hayes, S. M. Kakade: ``Stochastic Linear Optimization under Bandit Feedback,'' COLT, 2008 [Paper]

Y. Abbasi-yadkori, D. Pál, C. Szepesvári: ``Improved Algorithms for Linear Stochastic Bandits," NIPS 2011 [Paper]

O. Cappe, A. Garivier, C. Szepesvári: ``Parametric Bandits: The Generalized Linear Case," NIPS 2010 [Paper]

Dueling bandits

In many cases it is hard to quanity reward from actions. However, it is easier to compare them and know which one is relatively better. For example, it is easier to compare taste of food items than quantiy their absolute taste. Dueling bandit setting considers that in each round two actions can be compared which gives noisy binary feedback on which action is better. The goal is to identify the action with the highest reward.

Y. Yue, J. Broder, R. Kleinberg, T. Joachims, ``The K-armed Dueling Bandits Problem," COLT 2009 [Paper]

N. Ailon, Z. Karnin, T. Joachims: ``Reducing Dueling Bandits to Cardinal Bandits," ICML 2014 [Paper]

M. Zoghi1, S. Whiteson, R. Munos, M. Rijke1: ``Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem," ICML 2014 [Paper]

Online learning with feedback graphs

In class we studied online learning problems with a given feedback graph that remains fixed. However, the feedback graph may change in each round. Then, how to adapt learning to this scenario and what is the performance gurantee? Also, we studied feedback graphs with edge weights taking vlaue either one or zero, i.e, the losses/rewards of the neighbors are observed perfectly. However, it may happen that we will get to observe only noisy version of the losses of the neighbors (edge weights are fractions?). Then, again, how to adapt the algorithms?

N. Alon, N. Cesa-Bianchi, O. Dekel, T. Koren: ``Online Learning with Feedback Graphs: Beyond Bandits," COLT 2015 [Paper]

T. Kocak, G. Neu, M. Valko, R. Munos: ``Efficient learning by implicit exploration in bandit problems with side observations," 2014 [Paper]

N. Alon, N. Cesa-Bianchi, C. Gentile, Y. Mansour: ``From Bandits to Experts: A Tale of Domination and Independence," NIPS 2013 [Paper]

T. Kocák, G. Neu, M. Valko and R. Munos: ``Efficient Learning by Implicit Exploration in Bandit Problems with Side Observations,'' NIPS 2014 [Paper]

T. Kocák, G. Neu and M. Valko: ``Online learning with noisy side observations,'' AISTATS 2016 [Paper]

Decoupling exploration and exploitation

The standard multi-armed bandit algorithms tradeoff optimally between exploration and exploitation to get good regret bounds. The tradeoff is necessary as ony one arm can be played in each round and only for this arm reward is revealed. However, if one can play multiple arms in each round where only reward of one arm counts towards regret but not the others, then the fine tradeoff between the exploration and exploitation can be relaxed. The arm reward on which we have low confidence can be aggressively explored without worrying about possibility of them worsoning the regret.

Orly Avner, Shie Mannor: ``Decoupling Exploration and Exploitation in Multi-Armed Bandits," 2012 [Paper]

J. Yu and Shie Mannor: ``Piecewise-stationary bandit problems with side observations," ICML 2009 [Paper]

Thompson sampling

In the study of stochastic multi-armed bandits, we considered that the parameters of the arms are fixed but unknown (frequentist approach), and ignored any prior knowledge on their values. However, in many applications we may have information about the likely values of the parameters which we can use as a prior distribution. Thompson sampling takes Bayesian approach, where a prior distribuion is assigned to the arm parameters, and in each round the arm paramteres are sampled from a posterior distribution. The arm with the highest parameter value is played in each round.

S. Agrawal, N. Goyal: ``Analysis of Thompson Sampling for the Multi-armed Bandit Problem," JMLR 2012 [Paper]

S. Agrawal, N. Goyal: ``Further optimal regret bounds for Thompson Sampling," AISTATS 2013 [Paper]

E. Kaufmann, N. Korda, and R. Munos: ``Thompson Sampling: An Optimal Finite Time Analysis," ALT 2012 [Paper]

Best arm identification: fixed budget and fixed confidence

In many applications (like, lauching a new product) one has to fix an action and play it repeatedly after an intial phase of trials. Natually, we would like the action to be fixed to be the best one without worrying about the losses incurred in the trial process. Here it is important that we do a good exploration before we decide to fix an action so that we have good information about all the arms. If the goal is to identify the best action with high confidence, we would like to do it with as few rounds of Explorations as possible (fixed confidence) and if the goal is to identify the best arm within a given number of rounds then we would like to do it with high confidence (fixed budget).

J-Y. Audibert, S. Bubeck, and R. Munos: ``Best arm identication in multi-armed bandits," COLT 2010 [Paper]

S. Bubeck, T. Wang, and N. Viswanathan: ``Multiple identications in multi-armed bandits," ICML 2013 [Paper]

V. Gabillon, M. Ghavamzadeh, and A. Lazaric: ``Best arm identication: A unified approach to fixed budget and fixed confidence," NIPS 2012 [Paper]

M. Soare and A. Lazaric, and R. Munos: ``Best-Arm Identification in Linear Bandits," NIPS 2014 [Paper]

Pure explorations: cumulative regret vs simple regret

In the pure exploration problems we can aim to identify an arm that gives smallest instantaneous regret when the exploration phase is over (simple regret). How do strategies that aim to minimize cumulative regret differ from strategies that aim to minimize simple regret? Do minimizing cumulative regret is equivalent to mimining simple regret?

S. Bubeck, R. Munos, and G. Stoltz: ``Pure Exploration in Multi-armed Bandits Problems ," ALT 2009 [Paper]

Stochastic Multiplayer Multiarmed Bandits

Standard stochastic multi-armed bandits deals with one player who aims to learn the best action from a set of available action. In many applications, there could be more than one player and play the same set of arms. In many applications, two players selecting the same arm is not desirable. for example, in cognitive radio networks if two players select the same arm, there will be collision and none of the players will get any reward. In such scenarrios how to develop distributed algorithms that will result in maximizing the over reward for all the players. The main challenge is the players cannot communicate with each other and they have to make decision ina distributed fashion.

S. Bubeck, R. Munos, and G. Stoltz: ``Pure Exploration in Multi-armed Bandits Problems ," ALT 2009 [Paper]

Rosenski, O. Shami, and L. Szlak, “Multi-player bandits – a musical chairs approach,” in Proceedings of International Conference on Machine Learning (ICML), New York, USA, 2016

L. Besson and E Kaufmann, “Multi-player Bandits Revisited,” in Proceeding of Algorithmic Learning Theory (ALT), 2018

M.K. Hanawal and S. J. Darak, “Distributed Learning in Ad-Hoc Networks with Unknown Number of Players”, Workshop on AI in Networks and Distributed Systems, Toulouse, France, 2018

A. Verma, M. K. Hanawal, and R. Vaze, “Distributed Algorithms for Efficient Learning and Coordination in Ad Hoc Networks” Proceeding of Intl. Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), Avignon, France, June 2019.

E. Boursier and V. Perchet, “SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits” In proceeding of Advances in Neural Information Processing Systems 32 (NeurIPS 2019)

Stochastic Multiplayer Multiarmed Bandits without Collision Sensing

Same as previous one but when an arm is selected by more than one player, the players do not know about it, i.e., collision information is not available.

Gabor Lugosi, Abbas Mehrabian, “Multiplayer bandits without observing collision information” Available on Arxiv

Sébastien Bubeck, Thomas Budzinski, Mark Sellke, “Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions” Available on Arxiv

Adversarial Multiplayer Multiarmed Bandits

Same as the previous one but in the adversarial setting

Sébastien Bubeck, Yuanzhi Li, Yuval Peres, Mark Sellke, “Non-Stochastic Multi-Player Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without” Proceedings of Thirty Third Conference on Learning Theory, 2020.

Pragnya Alatur, Kfir Y. Levy, Andreas Krause, “Multi-Player Bandits: The Adversarial Case” In Journal of Machine Learning Research 21 (2020) 1-23

Federated Multi Armed Bandits

Federated learning (FL) is an emerging distributed machine learning paradigm that has many attractive properties. In particular, FL is motivated by the growing trend that massive amounts of real-world data are exogenously generated at edge devices, which are non-independent and identically distributed (non-IID) and highly imbalanced (Bonawitz et al., 2019). FL focuses on many clients collaboratively training a machine learning model under the coordination of a central server while keeping the local data private at each client (McMahan et al., 2017).

C Shi, C Shen, J Yang, "Federated Multi-armed Bandits with Personalization" - arXiv preprint arXiv:2102.13101, 2021

Batched Neural Bandits

Various reward models have been considered in the literature, such as linear models and kernel-based models. Recently, neural network models that allow for a more powerful approximation of the underlying reward functions have been proposed. For example, the NeuralUCB algorithm can achieve near-optimal regret bounds while only requiring a very mild boundedness assumption on the rewards. However, a major shortcoming of NeuralUCB is that it requires updating the neural network parameters in every round, as well as optimizing the loss function over observed rewards and contexts. This task is computationally expensive and makes NeuralUCB considerably slow for large-scale applications and inadequate for the practical situations where the data arrives at a fast rate.

Q Gu, A Karbasi, K Khosravi, V Mirrokni, D Zhou "Batched Neural Bandits" arXiv preprint arXiv:2102.13028, 2021

Multi Armed bandits for Matching Markets

In the standard matching problem, users and providers are matched to ensure incentive compatibility via the notion of stability. However, contrary to the core assumption of the matching problem, users and providers do not know their true preferences a priori and must learn them. To address this assumption, recent works propose to blend the matching and multi-armed bandit problems. They establish that it is possible to assign matchings that are stable (i.e., incentive-compatible) at every time step while also allowing agents to learn enough so that the system converges to matchings that are stable under the agents’ true preferences. However, while some agents may incur low regret under these matchings, others can incur high regret.

SH Cen, D Shah, "Regret, stability, and fairness in matching markets with bandit learners" arXiv preprint arXiv:2102.06246, 2021

Multi Armed bandits in Queueing Systems

In multi-server queueing systems where there is no central queue holding all incoming jobs, job dispatching policies are used to assign incoming jobs to the queue at one of the servers. Classic job dispatching policies such as join-the-shortest-queue and shortest expected delay assume that the service rates and queue lengths of the servers are known to the dispatcher. This problem presents a novel exploration-exploitation trade-off between sending jobs to all the servers to estimate their service rates, and exploiting the currently known fastest servers to minimize the expected queueing delay. We propose a bandit-based exploration policy that learns the service rates from observed job departures. Unlike the standard multi-armed bandit problem where only one out of a finite set of actions is optimal, here the optimal policy requires identifying the optimal fraction of incoming jobs to be sent to each server.

T. Choudhury, G. Joshi, W. Wang, and S. Shakkottai, "Job Dispatching Policies for Queueing Systems with Unknown Service Rates" Link: https://www.andrew.cmu.edu/user/gaurij/queueing_bandits.pdf

R. Sen, R. Johari, and S. Shakkottai, "Regret of Queueing Bandits" in NIPS 2016