In this talk, I will talk about principled ways of solving a classical reinforcement learning (RL) problem and introduce its robust variant.
In particular, we rethink the exploration-exploitation trade-off in RL as an instance of a distribution sampling problem in infinite dimensions. Using the powerful Stochastic Gradient Langevin Dynamics (SGLD), we propose a new RL algorithm, which results in a sampling variant of the Twin Delayed Deep Deterministic Policy Gradient (TD3) method. Our algorithm consistently outperforms existing exploration strategies for TD3 based on heuristic noise injection strategies in several MuJoCo environments.
The sampling perspective enables us to introduce an action-robust variant of RL objective, which is as a particular case of a zero-sum two-player Markov game. In this setting, at each step of the game, both players simultaneously choose an action. The reward each player gets after one step depends on the state and the convex combination of the actions of both players. Based on our earlier work (SGLD for min-max/GAN problem), we propose a new robust RL algorithm with convergence guarantee and provide numerical evidence of the new algorithm. Finally, I will also discuss future directions on the application of the framework to self-play in games.