RL Weekly 24: Benchmarks for Model-based RL and Bonus-based Exploration Methods

Benchmarking Model-based Reinforcement Learning

Tingwu Wang¹², Xuchan Bao¹², Ignasi Clavera³, Jerrick Hoang¹², Yeming Wen¹², Eric Langlois¹², Shunshi Zhang¹², Guodong Zhang¹², Pieter Abbeel³, Jimmy Ba¹²

¹University of Toronto, ²Vector Institute, ³UC Berkeley

What it says

Reinforcement learning algorithms are often partitioned into two categories: model-free RL and model-based RL. Model-free algorithms such as Rainbow or Soft Actor Critic (SAC) can achieve great performance in diverse tasks, but only at the expense of high sample complexity. Model-based algorithms use a model of the environment to lower sample complexity but suffers in performance due to “model-bias”, a phenomenon where policies are trained to exploit the inaccuracies of the environment model. Various methods have been proposed to mitigate this model-bias phenomenon, and recent model-based algorithms yielded results competitive to their model-free counterparts. Yet, it is difficult to measure progress in the field of model-based RL due to the lack of standardization of results. Different papers use different environments, different preprocessing techniques, and different rewards, making it impossible to compare them directly.

Thus, the authors benchmark 11 model-based algorithms and 4 model-free algorithms across 18 environments. The authors categorize the model-based algorithms into three categories: Dyna-style algorithms (Section 3.1), Policy Search with backpropagation through time (Section 3.2), and Shooting algorithms (Section 3.3). For each algorithm, the authors analyze its performance (Section 4.3) and its robustness to noisy observations or actions (Section 4.4). The authors come to a conclusion that there is no clear winner yet in the field of model-based RL (Table 5).

The authors also hypothesize three main causes of performance degradation: dynamics bottleneck, planning horizon dilemma, and early termination dilemma. Dynamics bottleneck is a problem where model-based RL are more prone to plateau in local minimas that their model-free counterparts (Section 4.5). Planning horizon dilemma is a dilemma where increasing the horizon in one hand allows for better reward estimation, but on the other hand decreases performance due to the curse of dimensionality (Section 4.6, Appendix F). Finally, the early termination dilemma is a problem that most early termination cannot be used successfully with model-based RL methods (Section 4.7, Appendix G).

Read more

External resources

Dyna-Style Algorithms
Policy Search with Backpropagation through Time
Shooting Algorithms
Model-free Baselines

Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment

Adrien Ali Taïga¹², William Fedus¹², Marlos C. Machado², Aaron Courville¹³, Marc G. Bellemare²³

¹MILA, Université de Montréal, ²Google Research, Brain Team, ³CIFAR Fellow

What it says

Recently, various exploration methods have been proposed to help reinforcement learning agents perform well in “hard exploration games” such as Montezuma’s Revenge. Among them are bonus-based methods, where the “extrinsic” reward given by the environment is augmented by an “intrinsic” reward computed by the agent. In recent years, various intrinsic rewards have been proposed, including pseudo-counts through density models (CTS, Bellemare et al. 2016; PixelCNN, Van den Oord et al., 2016), Intrinsic Curiosity Module (ICM, Pathak et al., 2017), and Random Network Distillation (RND, Burda et al., 2019).

Although these papers report their performances, it is difficult to compare them due to several discrepancies. Pseudo-count-based intrinsic rewards were tested with Deep Q-Networks (DQNs), whereas ICM and RND were tested with Proximal Policy Optimization (PPO). They also use different hyperparameters such as the number of frames, the number of random seeds, or discount factors.

Thus, to allow for direct comparison, the authors fixed these discrepancies and trained these methods. The authors used Dopamine-style Rainbow, which is an upgraded version of DQN with prioritized experience replay, n-step learning, and distributional RL. The authors also recorded the performance of an epsilon-greedy agent and NoisyNet agent to serve as baseline methods.

The authors report that on Montezuma’s Revenge (top figure), pseudo-counts (CTS, PixelCNN) outperformed newer algorithms (ICM, RND). Surprisingly, the authors also report that for other Atari games, no bonus-based methods outperform the baselines methods: epsilon-greedy and NoisyNet. This is both the case for games labelled as “hard exploration games” (next six figures) and “easy exploration games” (last six figures).

Read more

Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment (PDF)

External resources

One-line introductions to some more exciting news in RL this week:

SLAC: By learning a compact latent representation space and learn a critic model in this latent space, this model-free algorithm can achieve sample efficiency competitive to model-based algorithm!
DADS: An agent can discover diverse skills without any extrinsic reward, and model predictive control can compose these skills to solve downstream tasks without additional training!
MULEX: Using multiple Q-networks for each extrinsic or intrinsic reward to disentangle exploration and exploitation gives better performance than using a single Q-network in Montezuminha, a grid world version of Montezuma’s Revenge!
MoPPO: Combining Modified Policy Iteration (MPI) with PPO soft greediness for performance competitive to state-of-the-art (Soft Actor Critic)!
QRDRL: Quantile Regression can be used in environments with continuous action-spaces too, and is competitive to PPO and N-TRPO on MuJoCo environments!

RL Weekly 24: Benchmarks for Model-based RL and Bonus-based Exploration Methods

Subscribe to RL Weekly

Benchmarking Model-based Reinforcement Learning

Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment

Related Posts