RL Weekly 37: Observational Overfitting, Hindsight Credit Assignment, and Procedurally Generated Environment Suite

Dear readers,

Happy NeurIPS! This week, I have made my summaries more concise to improve the reading experience. I hope that this change makes the newsletter easier to digest.

I wait for your feedback, either by email or a feedback form. Your input is always appreciated.

- Ryan

Observational Overfitting in Reinforcement Learning

Xingyou Song¹, Yiding Jiang¹, Yilun Du², Behnam Neyshabur¹

¹Google ²MIT

What it says

Observational overfitting is a phenomenon “where an agent overfits due to properties of the observation which are irrelevant to the latent dynamics of the MDP family.” For example, in the saliency map above, the score and the background objects are highlighted red as they are deeply correlated with progress. This could hinder generalization: the authors report that simply covering the scoreboard with a black rectangle during training resulted in a 10% increased test performance. The authors use a Linear Quadratic Regulator (LQR) to validate the phenomenon, and find that overparametrizing potentially helps as a form of “implicit regularization.” The authors also try ImageNet networks (AlexNet, Inception, ResNet, etc.) on CoinRun environments, and show that overparametrization improves generalization to the test set.

Read more

Hindsight Credit Assignment

Anna Harutyunyan¹, Will Dabney¹, Thomas Mesnard¹, Nicolas Heess¹, Mohammad G. Azar¹, Bilal Piot¹, Hado van Hasselt¹, Satinder Singh¹, Greg Wayne¹, Doina Precup¹, Rémi Munos¹

¹DeepMind

What it says

Estimating the value function is a critical part of RL, as it quantifies how choosing an action in a state affects future return. The reverse of this is the credit assignment question: “given an outcome, how relevant were past decisions?” The authors define the “hindsight distribution” of an action as the conditional probability of the first action of the trajectory being that action over trajectories given some outcome (either state-dconditional or return-conditional). This learned hindsight distribution can be used to better estimate value functions or policy gradients. The authors validate new algorithms that use Hindsight Credit Assignment in a few diagnostic tasks.

Read more

Leveraging Procedural Generation to Benchmark Reinforcement Learning

Karl Cobbe¹, Christopher Hesse¹, Jacob Hilton¹, John Schulman¹

¹OpenAI

What it says

OpenAI (Cobbe et al.) released a set of 15 new environments similar to the CoinRun environment released last year, where the environments are “procedurally generated.” Having content procedurally generated in many aspects (level layout, game assets, entity spawn location and timing, etc.) encourages the agent to learn a policy robust to such variations. Procedurally generated environments also allow for a natural division of training and test set by generating different environments.

Read more

External resources

Quantifying Generalization in Reinforcement Learning (arXiv Preprint): The original paper for CoinRun
Obstacle Tower: A Generalization Challenge in Vision, Control, and Planning: Obstacle Tower, a 3D procedurally generated environment by Unity

Here are some more exciting news in RL:

Reinforcement Learning: Past, Present, and Future Perspectives
The recording of a NeurIPS 2019 presentation on RL by Katja Hofmann (Microsoft Research) is available online.

Stable Baselines - Reinforcement Learning Tips and Tricks
Stable Baselines, a major well-maintained fork of OpenAI Baselines, released a set of tips and tricks for RL.

Winner Announced for NeurIPS 2019: Learn to Move - Walk Around
The winners for each track of the NeurIPS 2019: Learn to Move - Walk Around was announced.

New State-of-the-art for Hanabi
Facebook AI wrote a blog post on how they build a new bot that achieves state-of-the-art in Hanabi, a collaborative card game.

RL Weekly 37: Observational Overfitting, Hindsight Credit Assignment, and Procedurally Generated Environment Suite

Subscribe to RL Weekly

Observational Overfitting in Reinforcement Learning

Hindsight Credit Assignment

Leveraging Procedural Generation to Benchmark Reinforcement Learning

Related Posts