Projects for IE617
Contextual Bandits
In many learning settings, the optimal action in a round can depend on the current context or observed features. For example,
if an ad related to sports shoe or an ad related to retirment plan has to be displayed on a webpage and if we know that the current
user is a young male (context), we are better off by displaying the sports shoe ad. Naturally, such setting have found applications in
ad placements and recommendation systems.
A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, R. Schapire: ``Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits,''
ICML 2014 [Paper]
L. Zhou: ``A Survey on Contextual Multi-armed Bandits," 2015 [Paper]
Stochastic linearly parametrized bandits
The standard stochastic multi-armed bandit setting considers that the arms are independent. The regret bounds in this
setting grows linearly in the number of arms and is not useful when the number of arms grows large. However, the assumption of indepndent arms is
strong in many applications. For example, in revenue mangement, advertisements, etc. playing an arm revelas reward information about
other 'similar' arms. The stochastic linear bandit setting considers that each arm is described by a vector (features) and its mean rewards is a linear
function (unknown) of the features.
V. Dani, T. P. Hayes, S. M. Kakade: ``Stochastic Linear Optimization under Bandit Feedback,''
COLT, 2008 [Paper]
Y. Abbasi-yadkori, D. Pál, C. Szepesvári: ``Improved Algorithms for Linear Stochastic Bandits," NIPS 2011 [Paper]
O. Cappe, A. Garivier, C. Szepesvári: ``Parametric Bandits: The Generalized Linear Case," NIPS 2010
[Paper]
Dueling bandits
In many cases it is hard to quanity reward from actions. However, it is easier to compare them and know which one is relatively better. For example,
it is easier to compare taste of food items than quantiy their absolute taste. Dueling bandit setting considers that in each round two actions can be compared which gives
noisy binary feedback on which action is better. The goal is to identify the action with the highest reward.
Y. Yue, J. Broder, R. Kleinberg, T. Joachims, ``The K-armed Dueling Bandits Problem," COLT 2009
[Paper]
N. Ailon, Z. Karnin, T. Joachims: ``Reducing Dueling Bandits to Cardinal Bandits," ICML 2014
[Paper]
M. Zoghi1, S. Whiteson, R. Munos, M. Rijke1: ``Relative Upper Confidence Bound for the
K-Armed Dueling Bandit Problem," ICML 2014
[Paper]
Online learning with feedback graphs
In class we studied online learning problems with a given feedback graph that remains fixed. However, the feedback graph may change in each round. Then, how to
adapt learning to this scenario and what is the performance gurantee? Also, we studied feedback graphs with edge weights taking vlaue either one or zero, i.e, the losses/rewards
of the neighbors are observed perfectly. However, it may happen that we will get to observe only noisy version of the losses of the neighbors (edge weights are fractions?).
Then, again, how to adapt the algorithms?
N. Alon, N. Cesa-Bianchi, O. Dekel, T. Koren: ``Online Learning with Feedback Graphs: Beyond Bandits," COLT 2015
[Paper]
T. Kocak, G. Neu, M. Valko, R. Munos: ``Efficient learning by implicit exploration in bandit
problems with side observations," 2014
[Paper]
N. Alon, N. Cesa-Bianchi, C. Gentile, Y. Mansour: ``From Bandits to Experts: A Tale of Domination and Independence," NIPS 2013
[Paper]
T. Kocák, G. Neu, M. Valko and R. Munos: ``Efficient Learning by Implicit Exploration in Bandit Problems with Side Observations,'' NIPS 2014
[Paper]
T. Kocák, G. Neu and M. Valko: ``Online learning with noisy side observations,'' AISTATS 2016
[Paper]
Decoupling exploration and exploitation
The standard multi-armed bandit algorithms tradeoff optimally between exploration and exploitation to get good regret bounds. The tradeoff is necessary as ony one arm can be played
in each round and only for this arm reward is revealed. However, if one can play multiple arms in each round where only reward of one arm counts towards regret
but not the others, then the fine tradeoff between the exploration and exploitation can be relaxed. The arm reward on which we have low confidence can be
aggressively explored without worrying about possibility of them worsoning the regret.
Orly Avner, Shie Mannor: ``Decoupling Exploration and Exploitation in Multi-Armed Bandits," 2012
[Paper]
J. Yu and Shie Mannor: ``Piecewise-stationary bandit
problems with side observations," ICML 2009 [Paper]
Thompson sampling
In the study of stochastic multi-armed bandits, we considered that the parameters of the arms are fixed but unknown (frequentist approach), and ignored any
prior knowledge on their values. However, in many applications we may have information about the likely values of the parameters
which we can use as a prior distribution. Thompson sampling takes Bayesian approach, where a prior distribuion is assigned to the arm parameters, and in
each round the arm paramteres are sampled from a posterior distribution. The arm with the highest parameter value is played in each round.
S. Agrawal, N. Goyal: ``Analysis of Thompson Sampling for the Multi-armed Bandit Problem," JMLR 2012
[Paper]
S. Agrawal, N. Goyal: ``Further optimal regret bounds for Thompson Sampling," AISTATS 2013
[Paper]
E. Kaufmann, N. Korda, and R. Munos: ``Thompson Sampling: An Optimal Finite Time Analysis," ALT 2012
[Paper]
Best arm identification: fixed budget and fixed confidence
In many applications (like, lauching a new product) one has to fix an action and play it repeatedly after an intial phase of trials. Natually, we would
like the action to be fixed to be the best one without worrying about the losses incurred in the trial process. Here it is important that we do a good exploration
before we decide to fix an action so that we have good information about all the arms. If the goal is to identify the best action with high confidence, we would like
to do it with as few rounds of Explorations as possible (fixed confidence) and if the goal is to identify the best arm within a given number of rounds then
we would like to do it with high confidence (fixed budget).
J-Y. Audibert, S. Bubeck, and R. Munos: ``Best arm identication in multi-armed bandits," COLT 2010
[Paper]
S. Bubeck, T. Wang, and N. Viswanathan: ``Multiple identications in multi-armed bandits," ICML 2013
[Paper]
V. Gabillon, M. Ghavamzadeh, and A. Lazaric: ``Best arm identication: A unified approach to fixed budget and fixed confidence," NIPS 2012
[Paper]
M. Soare and A. Lazaric, and R. Munos: ``Best-Arm Identification in Linear Bandits," NIPS 2014
[Paper]
Pure explorations: cumulative regret vs simple regret
In the pure exploration problems we can aim to identify an arm that gives smallest instantaneous regret when the exploration phase is over (simple regret).
How do strategies that aim to minimize cumulative regret differ from strategies that aim to minimize simple regret? Do minimizing cumulative regret is
equivalent to mimining simple regret?
S. Bubeck, R. Munos, and G. Stoltz: ``Pure Exploration in Multi-armed Bandits Problems
," ALT 2009 [Paper]
Stochastic Multiplayer Multiarmed Bandits
Standard stochastic multi-armed bandits deals with one player who aims to learn the best action from a set of available action.
In many applications, there could be more than one player and play the same set of arms. In many applications, two
players selecting the same arm is not desirable. for example, in cognitive radio networks if two players select the same
arm, there will be collision and none of the players will get any reward. In such scenarrios how to develop distributed algorithms
that will result in maximizing the over reward for all the players. The main challenge is the players cannot communicate
with each other and they have to make decision ina distributed fashion.
S. Bubeck, R. Munos, and G. Stoltz: ``Pure Exploration in Multi-armed Bandits Problems
," ALT 2009 [Paper]
Rosenski, O. Shami, and L. Szlak, “Multi-player bandits – a musical chairs approach,” in Proceedings of International
Conference on Machine Learning (ICML), New York, USA, 2016
L. Besson and E Kaufmann, “Multi-player Bandits Revisited,” in Proceeding of Algorithmic Learning Theory (ALT), 2018
M.K. Hanawal and S. J. Darak, “Distributed Learning in Ad-Hoc Networks with Unknown Number of Players”,
Workshop on AI in Networks and Distributed Systems, Toulouse, France, 2018
A. Verma, M. K. Hanawal, and R. Vaze, “Distributed Algorithms for Efficient Learning and Coordination in Ad Hoc
Networks” Proceeding of Intl. Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt),
Avignon, France, June 2019.
E. Boursier and V. Perchet, “SIC-MMAB: Synchronisation Involves Communication in Multiplayer
Multi-Armed Bandits” In proceeding of Advances in Neural Information Processing Systems 32 (NeurIPS 2019)
Stochastic Multiplayer Multiarmed Bandits without Collision Sensing
Same as previous one but when an arm is selected by more than one player, the players do not know about it, i.e., collision
information is not available.
Gabor Lugosi, Abbas Mehrabian, “Multiplayer bandits without observing
collision information” Available on Arxiv
Sébastien Bubeck, Thomas Budzinski, Mark Sellke, “Cooperative and Stochastic Multi-Player Multi-Armed Bandit: Optimal Regret With Neither Communication Nor Collisions”
Available on Arxiv
Adversarial Multiplayer Multiarmed Bandits
Same as the previous one but in the adversarial setting
Sébastien Bubeck, Yuanzhi Li, Yuval Peres, Mark Sellke, “Non-Stochastic Multi-Player
Multi-Armed Bandits: Optimal Rate With Collision Information, Sublinear Without”
Proceedings of Thirty Third Conference on Learning Theory, 2020.
Pragnya Alatur, Kfir Y. Levy, Andreas Krause, “Multi-Player Bandits: The Adversarial Case”
In Journal of Machine Learning Research 21 (2020) 1-23
Federated Multi Armed Bandits
Federated learning (FL) is an emerging distributed machine learning paradigm that has many
attractive properties. In particular, FL is motivated by the growing trend that massive amounts
of real-world data are exogenously generated at edge devices, which are non-independent and
identically distributed (non-IID) and highly imbalanced (Bonawitz et al., 2019). FL focuses on
many clients collaboratively training a machine learning model under the coordination of a central
server while keeping the local data private at each client (McMahan et al., 2017).
C Shi, C Shen, J Yang, "Federated Multi-armed Bandits with Personalization" - arXiv preprint arXiv:2102.13101, 2021
Batched Neural Bandits
Various reward models have been considered in the literature, such as linear models and kernel-based
models. Recently, neural network models that allow for a more powerful approximation of the underlying reward functions have been proposed.
For example, the NeuralUCB algorithm can achieve near-optimal regret bounds while only requiring a very
mild boundedness assumption on the rewards. However, a major shortcoming of NeuralUCB is that
it requires updating the neural network parameters in every round, as well as optimizing the loss
function over observed rewards and contexts. This task is computationally expensive and makes NeuralUCB considerably slow for large-scale applications and inadequate for the practical situations
where the data arrives at a fast rate.
Q Gu, A Karbasi, K Khosravi, V Mirrokni, D Zhou "Batched Neural Bandits" arXiv preprint arXiv:2102.13028, 2021
Multi Armed bandits for Matching Markets
In the standard matching problem, users and providers are matched to ensure incentive compatibility via the notion of stability. However,
contrary to the core assumption of the matching problem, users and providers do not know their true
preferences a priori and must learn them. To address this assumption, recent works propose to blend
the matching and multi-armed bandit problems. They establish that it is possible to assign matchings
that are stable (i.e., incentive-compatible) at every time step while also allowing agents to learn enough
so that the system converges to matchings that are stable under the agents’ true preferences. However,
while some agents may incur low regret under these matchings, others can incur high regret.
SH Cen, D Shah, "Regret, stability, and fairness in matching markets with bandit learners" arXiv preprint arXiv:2102.06246, 2021
Multi Armed bandits in Queueing Systems
In multi-server queueing systems where there is no central queue holding all incoming jobs, job dispatching policies are used to assign
incoming jobs to the queue at one of the servers. Classic job dispatching policies such as join-the-shortest-queue and shortest
expected delay assume that the service rates and queue lengths of the servers are known to the dispatcher. This problem presents a novel
exploration-exploitation trade-off between sending jobs to all the servers to estimate their service rates, and exploiting the currently known fastest servers to minimize the expected
queueing delay. We propose a bandit-based exploration policy that learns the service rates from observed job departures. Unlike the
standard multi-armed bandit problem where only one out of a finite set of actions is optimal, here the optimal policy requires identifying the optimal fraction of incoming jobs to be sent to each server.
T. Choudhury, G. Joshi, W. Wang, and S. Shakkottai, "Job Dispatching Policies for Queueing Systems with Unknown Service Rates" Link: https://www.andrew.cmu.edu/user/gaurij/queueing_bandits.pdf
R. Sen, R. Johari, and S. Shakkottai, "Regret of Queueing Bandits" in NIPS 2016
|