As of August 27th, the Google Summer of Code (GSoC) coding phase is officially over. In this post, I look back at the summer, reviewing my accomplishments and shortcomings. Because I will continue contributing to TensorFlow and TF-Agents, I also outline my plans for the next fall.

Accomplishments

TL;DR

  • Implemented Random Network Distillation (RND).
  • Verified RND in MountainCar-v2, a hard-exploration environment.
  • Allow PPO to be trained on Atari environments.
  • Other minor documentation fixes.

I was able to implement Random Network Distillation (RND), a simple prediction-based exploration algorithm that achieved state-of-the-art performance in Atari Montezuma’s Revenge. Exploration algorithm specializes in “hard exploration” environments where the reward signal is not dense enough for the agent. Thus, I tested RND in Mountain Car, a classical hard exploration environment, and verified that RND was able to solve the task. In contrast, Proximal Policy Optimization (PPO), the baseline algorithm, struggled to succeed even once in the same setting.

Also, the example scripts provided by TF-Agents did not include training PPO on Atari environments. I created a new script that trains PPO on Atari environments mostly following the preprocessing techniques and hyperparameters of the original PPO paper. This took a surprisingly long time due to structural differences. For example, TF-Agents code used episode rollouts, whereas the original PPO used fixed-length rollouts.

Throughout the summer, I also found a few places where the docstrings and comments were not updated to match the code. I made a small pull request correcting these discrepancies.

Shortcomings

TL;DR

  • Did not implement ICM (Intrinsic Curiosity Module) or DQN+CTS (Context Switching).
  • Could not reproduce RND results in the original paper.

Unfortunately, contrary to my initial expectation, I was unable to implement all the algorithms I wrote in my proposal. DQN+CTS was removed at the beginning of the summer after discussing the timeline with my mentors, but I thought that I would be able to implement ICM since it was quite similar RND. In retrospect, it might have been better to start with implementing ICM then implementing RND, since RND had a lot of specific details that needed to be implemented before validation.

For RND, although I was able to verify the algorithm for Mountain Car environment, it did not work for some Atari environments, which made it unable to be merged to TF-Agents master. In both PPO and Atari, running on Atari Venture or Atari Montezuma’s Revenge resulted in a value estimation loss explosion. Such problem did not appear for Atari Pong. As the problem only occured for environments with long episode lengths, I suspect that this may be due to the different size rollouts that TF-Agents use.

Retrospective

Implementing state-of-the-art reinforcement learning algorithms in TensorFlow was definitely exciting, but another exciting aspect of GSoC was understanding what it takes to design, maintain, and contribute to a large-scale project that is developed and used by others. In my personal projects, I was the sole developer, so it was easy to understand the entire codebase and make drastic changes. However, in GSoC, I had to ask my mentors a lot of questions to just how some parts of the code functioned, and had to develop around existing codebase without breaking anything. This resulted in a lot more downtime than I had expected.

For those who are interested in applying to GSoC next summer, I highly recommend having a timeline with a good amount of time for unexpected events. I believe I was pretty accurate in my estimate of unexpected events that happened in my part, but I did not account for blockers from other contributors of TF-Agents. If you understand that your project will be developed in parallel with other projects, you will have a better timeline.

One thing I regret is not asking questions early for the first half of the program. Many of the problems I had struggled for days were resolved after a short talk with my mentors during the weekly meeting. Had I asked for help earlier, I might have finished implementing RND by now! I hope future GSoC participants do not make the same mistake.

What’s next?

Overall, I had an amazing experience this summer meeting awesome people and working on exciting projects. For this great summer, I must thank my mentors Oscar Ramirez and Mandar Sanjay Deshpande. As a central developer of TF-Agents, Oscar resolved many of the issues I had faced and offered guidance throughout the summer. As a former GSoC participant, Mandar gave valuable insights about having a successful summer project. I also thank Paige Bailey, our great GSoC TensorFlow organizer. I first learned about the program through Paige’s tweet, and she also provided computing resources via GCP.

Google Summer of Code (GSoC) is over, but I consider GSoC as an “introduction” to open source contribution, not an isolated project. I plan to continue contributing to TF-Agents, especially on wrapping up ICM and RND and merging it to master. I envision that RND will be production-ready in another month (September), and ICM will also be ready in two or three months (November). Afterwards, I will contribute to TF-Agents by responding to issues and making necessary fixes. I am excited for the future of TF-Agents!