RL Weekly 42: Special Issue on NeurIPS 2020 Competitions
In this special issue, we look at the four RL competitions that is a part of NeurIPS 2020.
RL Weekly 41: Adversarial Policies, Image Augmentation, and Self-Supervised Exploration with World Models
In this issue, we look at adversarial policy learning, image augmentation in RL, and self-supervised exploration through world models.
RL Weekly 40: Catastrophic Interference and Policy Evaluation Networks
In this issue, we look at two papers combating catastrophic interference. Memento combats interference by training two independent agents where the second agent takes off when the first agent is finished. D-NN and TC-NN reduce interference by mapping the input space to a higher-dimensional space. We also look at Policy Evaluation Network, a network that predicts the expected return given a policy.
RL Weekly 39: Intrinsic Motivation for Cooperation and Amortized Q-Learning
In this issue, we look at using intrinsic rewards to encourage cooperation in two-agent MDP. We also look at replacing maximization in Q-learning over all actions with a neural network to use Q-learning in environments with big action spaces.
Reinforcement Learning Papers Accepted to ICLR 2020
I have compiled a list of 106 reinforcement learning papers accepted to ICLR 2020.
RL Weekly 38: Clipped objective is not why PPO works, and the Trap of Saliency maps
In this issue, we look at the effect of PPO's code-level optimizations and the study of saliency maps in RL.
RL Weekly 37: Observational Overfitting, Hindsight Credit Assignment, and Procedurally Generated Environment Suite
In this issue, we look at Google and MIT's study on the observational overfitting phenomenon and how overparametrization helps generalization, a new family of algorithms using hindsight credit assignment by DeepMind, and a new environment suite by OpenAI consisting of procedurally generated environments.
RL Weekly 36: AlphaZero with a Learned Model achieves SotA in Atari
In this issue, we look at MuZero, DeepMind's new algorithm that learns a model and achieves AlphaZero performance in Chess, Shogi, and Go and achieves state-of-the-art performance on Atari. We also look at Safety Gym, OpenAI's new environment suite for safe RL.
RL Weekly 35: Escaping Local Optimas in Distance-based Rewards and Choosing the Best Teacher
In this issue, we look at an algorithm that use sibling trajectories to escape local optimas in distance-based shaped rewards, and an algorithm that dynamically chooses the best demonstrator teacher to train the student.
RL Weekly 34: Dexterous Manipulation of the Rubik's Cube and Human-Agent Collaboration in Overcooked
In this issue, we look at a robot hand manipulating and "solving" the Rubik's Cube. We also look at comparative performances of human-agnostic and human-aware RL agents in Overcooked when paired with human players.
RL Weekly 33: Action Grammar, the Squashing Exploration Problem, and Task-relevant GAIL
In this issue, we look at Action Grammar RL, a hierarchical RL framework that adds new macro-actions, improving performance of DDQN and SAC in Atari environments. We then look at a new algorithm that borrows just the benefits of SAC's bounded actions to TD3 to achieve better performance. Finally, we look at an improvement to GAIL on raw pixel observations by focusing on task-relevant details.
RL Weekly 32: New SotA Sample Efficiency on Atari and an Analysis of the Benefits of Hierarchical RL
In this issue, we look at LASER, DeepMind's improvement to V-trace that achieves state-of-the-art sample efficiency in Atari environments. We also look at Google AI and UC Berkeley's study on hierarchical RL, analyzing and isolating the reason behind the benefits of hierarchical RL.
RL Weekly 31: How Agents Play Hide and Seek, Attraction-Repulsion Actor Critic, and Efficient Learning from Demonstrations
In this issue, we look at OpenAI's work on multi-agent hide and seek and the behaviors that emerge. We also look at Mila's population-based exploration that exceeds the performance of various TD3 and SAC baselines. Finally, we look at DeepMind's R2D3, a new algorithm to learn from demonstrations.
Reinforcement Learning Papers Accepted to NeurIPS 2019
I have compiled a list of 184 reinforcement learning papers accepted to NeurIPS 2019.
RL Weekly 30: Learning State and Action Embeddings, a New Framework for RL in Games, and an Interactive Variant of Question Answering
In this issue, we look at a representation learning method to train state and action embeddings paired with TD3. We also look at a new framework with 20+ environments and algorithms from multiple fields of RL. Finally, we look at a new take on using RL for Question Answering (QA).
GSoC TensorFlow Part 7: Retrospective
As of August 27th, the Google Summer of Code coding phase is officially over. In this post, I look back at the summer, reviewing my accomplishments and shortcomings. Because I will continue contributing to TensorFlow and TF-Agents, I also outline my plans for the next fall.
RL Weekly 29: The Behaviors and Superstitions of RL, and How Deep RL Compares with the Best Humans in Atari
In this issue, we look at reinforcement learning from a wider perspective. We look at new environments and experiments that are designed to test and challenge the agents' capabilities. We also compare existing RL agents against the playthroughs of best Atari human players.
GSoC TensorFlow Part 6: Evaluating RND on Mountain Car
After finishing implementing Random Network Distillation by Burda et al., now it is time to evaluate the algorithm in various hard exploration environments. I first start the evaluation on Mountain Car, a simple environment that requires extensive exploration to reach the goal state.
RL Weekly 28: Free-Lunch Saliency and Hierarchical RL with Behavior Cloning
This week, we first look at Free-Lunch Saliency, a built-in interpretability module that does not deteriorate performance. Then, we look at HRL-BC, a combination of high-level RL policy with low-level skills trained through behavior cloning.
RL Weekly 27: Diverse Trajectory-conditioned Self Imitation Learning and Environment Probing Interaction Policies
This week, we look at a self imitation learning method that imitates diverse past experience for better exploration. We also summarize an environment probing policy that helps an agent adapt to different environments.
RL Weekly 26: Transfer RL with Credit Assignment and Convolutional Reservoir Computing for World Models
This week, we summarize a new transfer learning method using the Transformer reward model, and a world model controller that does not require training the feature extraction.
RL Weekly 25: Replacing Bias with Adaptive Methods, Batch Off-policy Learning, and Learning Shared Model for Multi-task RL
In this issue, we focus on replacing inductive bias with adaptive solutions (DeepMind), learning off-policy from expert experience (Google Brain), and learning a shared model for multitask RL (Stanford).
GSoC TensorFlow Part 5: Implementing the Core of RND
This week, I give a brief summary of Random Network Distillation (RND), complemented with the code I have written for TF-Agents. I then list further work needed to finish implementing RND, and plans for evaluating the algorithm once finished.
RL Weekly 24: Benchmarks for Model-based RL and Bonus-based Exploration Methods
This week, we summarize two benchmark papers. The first paper benchmarks 11 model-based RL algorithms in 18 continuous control environments, and the second paper benchmarks 4 bonus-based exploration methods in 9 Atari environments. Both papers agree that a standardized benchmark is needed for an objective analysis of new algorithms.
RL Weekly 23: Decentralized Hierarchical RL, Deep Conservative Policy Iteration, and Optimistic PPO
This week, we first introduce a ensemble of primitives without a high-level meta-policy that can make decentralized decisions. We then look at an deep learning extension of Conservative Policy Iteration that borrows the idea of DQN. Finally, we look at Optimistic PPO, an extension of PPO that encourages exploration through uncertainty bellman equation.
GSoC TensorFlow Part 4: First Evaluation
This week, I look back at the first coding phase of GSoC, summarizing my work and setting goals for the next phase.
RL Weekly 22: Unsupervised Learning for Atari, Model-based Policy Optimization, and Adaptive-TD
This week, we first look at ST-DIM, an unsupervised state representation learning method from MILA and Microsoft Research. We also check UC Berkeley's new policy optimization method that uses model-based branch rollouts. Finally, we look at Adapative-TD, a new method of mixing MC and TD from DeepMind, Google Research and Universitat Pompeu Fabr.
RL Weekly 21: The interplay between Experience Replay and Model-based RL
This week, we introduce three papers on replay-based RL and model-based RL. The first paper introduces SoRB, a way to combine experience replay and planning. The second paper introduces a consistency loss to ensure that a model is consistent with the real environment. The final paper compares model-based agents with replay-based agents.
RL Weekly 20: Minecraft Competition, Off-policy Policy Evaluation via Classification, and Soft-attention Agent for Interpretability
This week, we introduce MineRL, a new RL competition using human priors to solve Minecraft. We also introduce OPE, a method of off-policy evaluation through classification, and a soft-attention agent for greater interpretability.
GSoC TensorFlow Part 3: Simple Environment Wrapper with gin-config
This week, I implemented a simple environment wrappers to prepare myself before implementing curiosity modules.
RL Weekly 19: Curious Object-Based Search Agent, Multiplicative Compositional Policies, and AutoRL
This week, we introduce combining unsupervised learning, exploration, and model-based RL; learning composable motor skills; and evolving rewards.
GSoC TensorFlow Part 2: Improving Documentation
A great way to learn the material is to make modifications. This week, I summarize my experience of creating a pull request to TF-Agents to improve its documentation.
RL Weekly 18: Survey of Domain Randomization Techniques for Sim-to-Real Transfer, and Evaluating Deep RL with ToyBox
This week, we introduce a survey of Domain Randomization Techniques for Sim-to-Real Transfer and ToyBox, a suite of redesigned Atari Environments for experimental evaluation of deep RL.
GSoC TensorFlow Part 1: Setting Up TF-Agents
I have been accepted to Google Summer of Code program to work on TensorFlow for three months. I will be working on TensorFlow's reinforcement learning library TF-Agents. In this post, I briefly summmarize the steps I took to setup the TF-Agents environment for future reference.
RL Weekly 17: Information Asymmetry in KL-regularized Objective, Real-world Challenges to RL, and Fast and Slow RL
In this issue, we summarize the use of information asymmetry in KL regularized objective to regularize the policy, the challenges of deploying deep RL into real-world systems, and possible insights into psychology and neuroscience from deep RL.
RL Weekly 16: Why Performance Plateaus May Occur, and Compressing DQNs
In this issue, we introduce 'ray interference,' a possible cause of performance plateaus in deep reinforcement learning conjectured by Google DeepMind. We also introduce a network distillation method proposed by researchers at Carnegie Mellon University.
RL Weekly 15: Learning without Rewards: from Active Queries or Suboptimal Demonstrations
In this issue, we introduce VICE-RAQ by UC Berkeley and T-REX by UT Austin and Preferred Networks. VICE-RAQ trains a classifier to infer rewards from goal examples and active querying. T-REX learns reward functions from suboptimal demonstrations ranked by humans.
RL Weekly 14: OpenAI Five and Berkeley Blue
In this week's issue, we summarize the Dota 2 match between OpenAI Five and OG eSports and introduce Blue, a new low-cost robot developed by the Robot Learning Lab at UC Berkeley.
RL Weekly 13: Learning to Toss, Learning to Paint, and How to Explain RL
In this week's issue, we summarize results from Princeton, Google, Columbia, and MIT on training a robot arm to throw objects. We also look at a model-based DDPG developed by Peking University and Megvii that can reproduce pictures through paint strokes. Finally, we look at an empirical study by Oregon State University about explaining RL to layman.
RL Weekly 12: Atari Demos with Human Gaze Labels, New SOTA in Meta-RL, and a Hierarchical Take on Intrinsic Rewards
This week, we look at a new demo dataset of Atari games that include trajectories and human gaze. We also look at PEARL, a new meta-RL method that boasts sample efficiency and performance superior to previous state-of-the-art algorithms. Finally, we look at a novel method of incorporating intrinsic rewards.
RL Weekly 11: The Bitter Lesson by Richard Sutton, the Promise of Hierarchical RL, and Exploration with Human Feedback
In this issue, we first look at a diary entry by Richard S. Sutton (DeepMind, UAlberta) on Compute versus Clever. Then, we look at a post summarizing Hierarchical RL by Yannis Flet-Berliac (INRIA SequeL). Finally, we summarize a paper incorporating human feedback for exploration from Delft University of Technology.
RL Weekly 10: Learning from Playing, Understanding Multi-agent Intelligence, and Navigating in Google Street View
In this issue, we look at Google Brain's algorithm of learning by playing, DeepMind's thoughts on multi-agent intelligence, and DeepMind's new navigation environment using Google Street View data.
RL Weekly 9: Sample-efficient Near-SOTA Model-based RL, Neural MMO, and Bottlenecks in Deep Q-Learning
In this issue, we look at SimPLe, a model-based RL algorithm that achieves near-state-of-the-art results on Arcade Learning Environments (ALE). We also look at Neural MMO, a new multiagent environment by OpenAI, and an empirical analysis of possible sources of error in deep Q-learning by BAIR.
RL Weekly 8: World Discovery Models, MuJoCo Soccer Environment, and Deep Planning Network
In this issue, we introduce World Discovery Models and MuJoCo Soccer Environment from Google DeepMind, and PlaNet from Google.
Obstacle Tower 6: Submitting a Random Agent
We submit a random agent to the Obstacle Tower Challenge that just began.
RL Weekly 7: Obstacle Tower Challenge, Hanabi Learning Environment, and Spinning Up Workshop
This week, we introduce the Obstacle Tower Challenge, a new RL competition by Unity, Hanabi Learning Environment, a multi-agent environment by DeepMind, and Spinning Up Workshop, a workshop hosted by OpenAI.
Obstacle Tower 5: Possible Improvements to the Baselines
We play the Obstacle Tower game to understand the qualities of a successful agent.
Obstacle Tower 4: Understanding the Baselines
We briefly introduce Rainbow and PPO, the two baselines that was tested on Obstacle Tower.
Slow Papers: The Obstacle Tower: A Generalization Challenge in Vision, Control, and Planning (Juliani et al., 2019)
The rapid pace of research development in Deep Reinforcement Learning has been driven by the presence of fast and challenging simulation environments. These environments often take the form of video games, such as the Atari games provided in the Arcade Learning Environment (ALE). In the past year, however, significant progress has been made in achieving superhuman performance on even the most difficult and heavily studied game in the ALE: Montezumas Revenge. We propose a new benchmark environment, Obstacle Tower: a high visual fidelity, 3D, 3rd person, procedurally generated environment. An agent in the Obstacle Tower must learn to solve both low level control and high-level planning problems in tandem learning from pixels and a sparse reward signal in order to make it as high as possible up the tower. In this paper we outline the environment and provide a set of initial baseline results using current state of the art Deep RL methods as well as human players. In all cases these algorithms fail to produce agents capable of performing anywhere near human level on a set of evaluations designed to test both memorization and generalization ability. As such, we believe that the Obstacle Tower has the potential to serve as a helpful Deep RL benchmark now and into the future.
Obstacle Tower 3: Observation Space and Action Space
We analyze the observation space and the action space provided by the Obstacle Tower environment.
Obstacle Tower 2: Playing the Game
We play the Obstacle Tower game to understand the qualities of a successful agent.
Obstacle Tower 1: Installing the Environment
Unity introduced the Obstacle Tower Challenge, a new reinforcement learning contest with a difficult environment. In this post, we guide the readers on installing the environment on Linux using conda.
RL Weekly 6: AlphaStar, Rectified Nash Response, and Causal Reasoning with Meta RL
This week, we look at AlphaStar, a Starcraft II AI, PSRO_rN, an evaluation algorithm encouraging diverse population of well-trained agents, and a novel Meta-RL approach for causal reasoning. All three results are from DeepMind.
Deep RL Seminar Week 2: Deep Q-Networks
This week, we reviewed various improvements made to the Deep Q-Network algorithm.
RL Weekly 5: Robust Control of Legged Robots, Compiler Phase-Ordering, and Go Explore on Sonic the Hedgehog
This week, we look at impressive robust control of legged robots by ETH Zurich and Intel, compiler phase-ordering by UC Berkeley and MIT, and a partial implementation of Uber's Go Explore.
RL Weekly 4: Generating Problems with Solutions, Optical Flow with RL, and Model-free Planning
In this issue, we introduce new curriculum learning algorithm by Uber AI Labs, model-free planning algorithm by DeepMind, and optical-flow based control algorithm by Intel Labs and University of Freiburg.
RL Weekly 3: Learning to Drive through Dense Traffic, Learning to Walk, and Summarizing Progress in Sim-to-Real
In this issue, we introduce the DeepTraffic competition from Lex Fridman's MIT Deep Learning for Self-Driving Cars course. We also review a new paper on using SAC to control a four-legged robot, and introduce a website summarizing progress in sim-to-real algorithms.
PyTorch Implementations of Policy Gradient Methods
A well-written baseline is crucial to research. We compare and recommend popular open source implementations of reinforcement learning algorithms in PyTorch.
RL Weekly 2: Tuning AlphaGo, Macro-strategy for MOBA, Sim-to-Real with conditional GANs
In this issue, we discuss hyperparameter tuning for AlphaGo from DeepMind, Hierarchical RL model for a MOBA game from Tencent, and GAN-based Sim-to-Real algorithm from X, Google Brain, and DeepMind.
RL Weekly 1: Soft Actor-Critic Code Release; Text-based RL Competition; Learning with Training Wheels
In this inaugural issue of the RL Weekly newsletter, we discuss Soft Actor-Critic (SAC) from BAIR, the new TextWorld competition by Microsoft Research, and AsDDPG from University of Oxford and Heriot-Watt University.
Slow Papers: Exploration by Random Network Distillation (Burda et al., 2018)
We introduce an exploration bonus for deep reinforcement learning methods that is easy to implement and adds minimal overhead to the computation performed. The bonus is the error of a neural network predicting features of the observations given by a fixed randomly initialized neural network. We also introduce a method to flexibly combine intrinsic and extrinsic rewards. We find that the random network distillation (RND) bonus combined with this increased flexibility enables significant progress on several hard exploration Atari games. In particular we establish state of the art performance on Montezuma's Revenge, a game famously difficult for deep reinforcement learning methods. To the best of our knowledge, this is the first method that achieves better than average human performance on this game without using demonstrations or having access the underlying state of the game, and occasionally completes the first level. This suggests that relatively simple methods that scale well can be sufficient to tackle challenging exploration problems.
Slow Papers: A Deeper Look at Experience Replay (Zhang and Sutton, 2017)
Recently experience replay is widely used in various deep reinforcement learning (RL) algorithms, in this paper we rethink the utility of experience replay. It introduces a new hyper-parameter, the memory buffer size, which needs carefully tuning. However unfortunately the importance of this new hyper-parameter has been underestimated in the community for a long time. In this paper we did a systematic empirical study of experience replay under various function representations. We showcase that a large replay buffer can significantly hurt the performance. Moreover, we propose a simple O(1) method to remedy the negative influence of a large replay buffer. We showcase its utility in both simple grid world and challenging domains like Atari games.
Slow Papers: Neural Fitted Q Iteration (Riedmiller, 2005)
This paper introduces NFQ, an algorithm for efficient and effective training of a Q-value function represented by a multi-layer perceptron. Based on the principle of storing and reusing transition experiences, a model-free, neural network based RL algorithm is proposed. The method is evaluated on three benchmark problems. It is shown empirically, that reasonably few interactions with the plant are neeed to generate control policies of high quality.
AI for Prosthetics Week 9 - 10: Unorthodox Approaches
We end the series by exploring possible unorthodox approaches for the competition. These are approaches that deviate from the popular policy gradient methods such as DDPG or PPO.
Notes from the ai.x 2018 Conference: Faster Reinforcement Learning via Transfer
SK T-Brain hosted the ai.x Conference on September 6th at Seoul, South Korea. At this conference, John Schulman (OpenAI) spoke about faster reinforcement learning via transfer.
Pommerman 1: Understanding the Competition
Pommerman is one of NIPS 2018 Competition tracks, where the participants seek to build agents to compete against other agents in a game of Bomberman. In this post, we simply explain the basics of Pommerman, leaving reinforcement learning to later posts.
AI for Prosthetics Week 6: General Techniques of RL
This week, we take a step back from the competition and study common techniques used in Reinforcement Learning.
AI for Prosthetics Week 5: Understanding the Reward
The goal of reinforcement learning is defined by the reward signal - to maximize the cumulative reward throughout an episode. In some ways, the reward is the most important aspect of the environment for the agent: even if it does not know about values of states or actions (like Evolutionary Strategies), if it can consistently get high return (cumulative reward), it is a great agent.
AI for Prosthetics Week 3-4: Understanding the Observation Space
The observation can be roughly divided into five components: the body parts, the joints, the muscles, the forces, and the center of mass. For each body part component, the agent observes its position, velocity, acceleration, rotation, rotational velocity, and rotational acceleration.
AI for Prosthetics Week 2: Understanding the Action Space
Last week, we saw how a valid action has 19 numbers, each between 0 and 1. The 19 numbers represented the amount of force to put to each muscle. I know barely anything about muscles, so I decided to manually go through all the muscles to understand the effects of each muscle...
AI for Prosthetics Week 1: Understanding the Challenge
The AI for Prosthetics challenge is one of NIPS 2018 Competition tracks. In this challenge, the participants seek to build an agent that can make a 3D model of human with prosthetics run. This challenge is a continuation of the Learning to Run challenge (shown below) that was part of NIPS 2017 Competition Track. The challenge was enhanced in three ways...
Bias-variance Tradeoff in Reinforcement Learning
Bias-variance tradeoff is a familiar term to most people who learned machine learning. In the context of Machine Learning, bias and variance refers to the model: a model that underfits the data has high bias, whereas a model that overfits the data has high variance. In Reinforcement Learning, we consider another bias-variance tradeoff.
I learned DQNs with OpenAI competition
On April, OpenAI held a two-month-long competition called the Retro Contest where participants had to develop an agent that can achieve perform well on unseen custom-made stages of Sonic the Hedgehog. The agents were limited to 100 million steps per stage and 12 hours of time on a VM with 6 E5-2690v3 cores, 56GB of RAM, and a single K80 GPU.