Learning policies with neural networks requires writing a reward function by hand or learning from human feedback. A recent paper on arXiv.org suggests simplifying the process by extracting the information already present in the environment.
Artificial intelligence – artistic concept. Image credit: geralt via Pixabay (free licence)
It is possible to infer that the user has already optimized towards its own preferences. The agent should take the same actions which the user must have done to lead to the observed state. Therefore, simulation backward in time is necessary. The model learns an inverse policy and inverse dynamics model using supervised learning to perform the backward simulation. The reward representation that can be meaningfully updated from a single state observation is then found.
The results show it is possible to reduce the human input in learning using this approach. The model successfully imitates policies with access to just a few states sampled from those policies.
Since reward functions are hard to specify, recent work has focused on learning policies from human feedback. However, such approaches are impeded by the expense of acquiring such feedback. Recent work proposed that agents have access to a source of information that is effectively free: in any environment that humans have acted in, the state will already be optimized for human preferences, and thus an agent can extract information about what humans want from the state. Such learning is possible in principle, but requires simulating all possible past trajectories that could have led to the observed state. This is feasible in gridworlds, but how do we scale it to complex tasks? In this work, we show that by combining a learned feature encoder with learned inverse models, we can enable agents to simulate human actions backwards in time to infer what they must have done. The resulting algorithm is able to reproduce a specific skill in MuJoCo environments given a single state sampled from the optimal policy for that skill.
Research paper: Lindner, D., Shah, R., Abbeel, P., and Dragan, A., “Learning What To Do by Simulating the Past”, 2021. Link: https://arxiv.org/abs/2104.03946