Video2Skill: Adapting Events in Demonstration Videos to Skills in an Environment using Cyclic MDP Homomorphisms

Offline reinforcement learning has been successfully used to teach agents to perform tasks simply from a corpus of expert demonstration data.

However, in this case, agents are trained to perform the same task as in the demonstration dataset. Also, the state and action spaces of the agent and the demonstrations must coincide. Therefore, this technique is hard to apply in real-life robotics scenarios. A recent paper tries to overcome these limitations.

Image credit: Pxfuel, free licence

The researchers use a large corpus of human video tutorials of complex tasks. A robotic agent is taught to perform meaningful behaviors in its environment. Firstly, real-world human demonstration data is used to learn environment agnostic event representations. Then, a small amount of demonstration data is utilized in the environment of a real robotic agent. The experiments show that the generated skills lead to the effective manipulation of objects.

Humans excel at learning long-horizon tasks from demonstrations augmented with textual commentary, as evidenced by the burgeoning popularity of tutorial videos online. Intuitively, this capability can be separated into 2 distinct subtasks – first, dividing a long-horizon demonstration sequence into semantically meaningful events; second, adapting such events into meaningful behaviors in one’s own environment. Here, we present Video2Skill (V2S), which attempts to extend this capability to artificial agents by allowing a robot arm to learn from human cooking videos. We first use sequence-to-sequence Auto-Encoder style architectures to learn a temporal latent space for events in long-horizon demonstrations. We then transfer these representations to the robotic target domain, using a small amount of offline and unrelated interaction data (sequences of state-action pairs of the robot arm controlled by an expert) to adapt these events into actionable representations, i.e., skills. Through experiments, we demonstrate that our approach results in self-supervised analogy learning, where the agent learns to draw analogies between motions in human demonstration data and behaviors in the robotic environment. We also demonstrate the efficacy of our approach on model learning – demonstrating how Video2Skill utilizes prior knowledge from human demonstration to outperform traditional model learning of long-horizon dynamics. Finally, we demonstrate the utility of our approach for non-tabula rasa decision-making, i.e, utilizing video demonstration for zero-shot skill generation.

Research paper: Duan, X., “Automated Security Assessment for the Internet of Things”, 2021. Link:


Notify of
Inline Feedbacks
View all comments
Would love your thoughts, please comment.x