Skip to main content

IEOR e-Seminar by Dr. Prasenjit Karmakar

Title: Risk-sensitive reinforcement learning, adaptive algorithms, off-policy learning using stochastic approximation and diffusion limit for shortest job first.

Speaker: Dr. Prasenjit Karmakar, Andrew and Erna Viterbi Postdoctoral Fellow, Department of Electrical Engineering, Technion, Israel.

Day, Date and Time: Monday, 14th September 2020, at 10 AM.

Abstract: In this talk, he will mainly focus two of my counter intuitive results.
In the first work we provide several informative tight error bounds on function approximation for the learning algorithm with linear function approximation of the policy evaluation problem for risk-sensitive cost criteria represented using exponential utility. The mathematical problem is to find a bound between Perron-Frobenius eigenvalues between two matrices A and B. Our main result is a bound between $\lambda - \mu$ in terms of the operator norm of the perturbation matrix $A-B$ where $\lambda$ and $\mu$ are Perron-Frobenius eigenvalues of $A$ and $B$.

In this work we extend the work by Gromoll, Kruk and Puha on diffusion limit for shortest job first for a fixed job-size distribution into a time varying case. More specifically, in their work the scaled state process $\hat{\xi_t^n} = n^{-1/2} $\xi_t^n$ converges in law to a process given by $R_t δ_E$, where {R_t} is a reflected Brownian motion representing the total workload scaling limit, and $δ_x$ denotes unit atom at x. It is natural to guess, based on the result for a fixed distribution  that if the time varying job-size distribution $E_t$ varies sufficiently smoothly or slowly, the limit of $\hat{\xi_t^n}$ takes the form $R_t δ_{E_t}$ . This intuition turns out to be wrong. Our results characterizing the diffusion limit provide a formula which takes the form $X_t − inf s∈[τ (t,x),t] X_s, τ (t, x) = sup {s ∈ [0, t] : x < E_s } ∨ 0$. Here $X(t)$ is some stochastic process.

Stochastic approximation algorithms are sequential non-parametric methods for finding a zero or minimum of a function in the situation where only the noisy observations of the function values are available. Using the lock-in probability framework by Borkar, we provide an analysis of the tracking ability of general (non-linear) adaptive algorithms when the iterates are not known to be stable beforehand. This work extends the work of Konda and Tsitsiklis on the tracking ability of `linear' adaptive algorithms.

We present for the first time an asymptotic convergence analysis of two time- scale stochastic approximation driven by 'controlled' Markov noise. In particular, the faster and slower recursions have non-additive controlled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in both time scales that are defined in terms of the ergodic occupation measures associated with the controlled Markov processes. Using a special case of our results, we present a solution to the off-policy convergence problem for temporal-difference learning with linear function approximation.

Speaker Bio: Dr. Prasenjit Karmakar received B.E. from the Computer Science Department of Jadavpur University in 2008. Then he moved to Indian Institute of Science from where he received MSc(Engg.) and PhD in Computer Science in 2012 and 2018 respectively. Between 2012-13 he was a Software Engineer 1 in Citrix R&D India Pvt. Limited. From Dec 2018 he is an Andrew and Erna Viterbi Postdoctoral Fellow in Department of Electrical Engineering Technion, Israel. His research interests include Stochastic Approximation and Stochastic control, Reinforcement learning, Machine Learning, Optimization, Applied Probability, Stochastic Network Theory.

 

News Category
Date Posted