Learning Compositional Neural Programs for Continuous Control

The area of machine learning called deep reinforcement learning has found many successful applications in modern industry and science, especially in such areas as dexterous object manipulation, agile locomotion, autonomous navigation.

However, some fundamental challenges remain: in order to reach human-level AI, algorithms must exhibit ability to plan and manage their activity in hierarchical structure, with varying degrees of abstraction. Also, model-free deep reinforcement learning agents require large number of interactions with their environment to optimize their policies.

In a new research paper appearing on arxiv.org, researchers propose using learned internal model of the world to reduce the number of necessary interactions with the environment. Such approach is based on designing low-level policies by decomposing complex tasks into constituting hierarchical structures, and then recomposing and re-purposing them in order to improve learning sample efficiency and decreasing the need to interact with the real environment:

We propose a novel solution to challenging sparse-reward, continuous control problems that require hierarchical planning at multiple levels of abstraction. Our solution, dubbed AlphaNPI-X, involves three separate stages of learning. First, we use off-policy reinforcement learning algorithms with experience replay to learn a set of atomic goal-conditioned policies, which can be easily repurposed for many tasks. Second, we learn self-models describing the effect of the atomic policies on the environment. Third, the self-models are harnessed to learn recursive compositional programs with multiple levels of abstraction. The key insight is that the self-models enable planning by imagination, obviating the need for interaction with the world when learning higher-level compositional programs. To accomplish the third stage of learning, we extend the AlphaNPI algorithm, which applies AlphaZero to learn recursive neural programmer-interpreters. We empirically show that AlphaNPI-X can effectively learn to tackle challenging sparse manipulation tasks, such as stacking multiple blocks, where powerful model-free baselines fail.

Link to research article: https://arxiv.org/abs/2007.13363