FORMAL PROTOCOL Online learning in a fixed MDP For each round t = 1,2, ..., • Learner observes state X, EX
13
TEMPORAL DEPENDENCES
14
REGRET DECOMPOSITION
15
THE DRIFT TERMS
16
LOCAL-TO-GLOBAL
17
THE MDP-EXPERT ALGORITHE
18
GUARANTEES FOR MDP-E
19
BANDIT FEEDBACK
20
ONLINE LINEAR OPTIMIZATIO
21
ONLINE MIRROR DESCENT
22
THE ONLINE REPS ALGORITH O-REPS
23
GUARANTEES FOR O-REPS
24
COMPARISON OF GUARANTE
25
MDP-E WITH FUNCTION APPROXIMATION MDP-E only needs a good approximation of the action-value
26
O-REPS WITH UNCERTAIN MO
27
OUTLOOK
Description:
Explore the intricacies of online learning in Markov Decision Processes (MDPs) in this comprehensive lecture from the Theory of Reinforcement Learning Boot Camp. Delve into topics such as adversarial scenarios, performance measures, oblivious and non-oblivious adversaries, and the challenges of learning with changing transitions. Examine the formal protocol for online learning in fixed MDPs, temporal dependencies, and regret decomposition. Discover the MDP-Expert algorithm, its guarantees, and applications in bandit feedback scenarios. Investigate online linear optimization, mirror descent, and the Online Relative Entropy Policy Search (O-REPS) algorithm. Compare various guarantees and explore function approximation in MDP-E. Gain insights into O-REPS with uncertain models and consider future directions in this field of study.
Online Learning in Markov Decision Processes - Part 2