Contemporary computer vision models can successfully classify, detect, and segment objects in Internet images reasonably well. A recent paper on arXiv.org investigates how an active agent embodied in a 3D environment could accomplish these tasks.
A robot with laser scanner and PTZ camera, gripper (closed). Image credit: Tim3672 via Wikimedia, CC-BY-SA-3.0
The suggested framework consists of two phases. Firstly, the agent learns a self-supervised exploration policy to gather observations of objects with at least one highly confident viewpoint. The observations are used to construct an episodic 3D semantic map. Then, the learned policy is used to gather a single episode of observations in an environment. The semantic map is used to compute the reward.
The experiments demonstrate that the model performance can be improved through self-supervised interaction. Data collected by the agent in the real world improve the perception models. Improved perception models, in turn, improve the agent’s policy for interacting with the world.
In this paper, we explore how we can build upon the data and models of Internet images and use them to adapt to robot vision without requiring any extra labels. We present a framework called Self-supervised Embodied Active Learning (SEAL). It utilizes perception models trained on internet images to learn an active exploration policy. The observations gathered by this exploration policy are labelled using 3D consistency and used to improve the perception model. We build and utilize 3D semantic maps to learn both action and perception in a completely self-supervised manner. The semantic map is used to compute an intrinsic motivation reward for training the exploration policy and for labelling the agent observations using spatio-temporal 3D consistency and label propagation. We demonstrate that the SEAL framework can be used to close the action-perception loop: it improves object detection and instance segmentation performance of a pretrained perception model by just moving around in training environments and the improved perception model can be used to improve Object Goal Navigation.
Research paper: Singh Chaplot, D., Dalal, M., Gupta, S., Malik, J., and Salakhutdinov, R., “SEAL: Self-supervised Embodied Active Learning using Exploration and 3D Consistency”, 2021. Link: https://arxiv.org/abs/2112.01001