Vision-based reinforcement learning could allow robotic manipulation in unstructured environments through flexible policies operating directly from visual feedback. However, current methods often rely on important state information that is difficult to obtain in the real world.
Anthropomorphic robot for liquid handling. Image credit: Pzucchel via Wikimedia, CC-BY-SA-3.0
A recent paper published on arXiv.org suggests a setting for vision-based robotic manipulation.
The agent receives visual feedback from both a third-person and an egocentric view. While the third-person view is static, the egocentric camera allows the robot to actively position its camera such that it provides additional information at regions of interest. The feedback from the two views is fused through a network architecture with soft attention mechanisms.
The experimental results show that multi-view methods are less prone to error, and the proposed mechanism improves precision in tasks that require fine-grained motor control.
Learning to solve precision-based manipulation tasks from visual feedback using Reinforcement Learning (RL) could drastically reduce the engineering efforts required by traditional robot systems. However, performing fine-grained motor control from visual inputs alone is challenging, especially with a static third-person camera as often used in previous work. We propose a setting for robotic manipulation in which the agent receives visual feedback from both a third-person camera and an egocentric camera mounted on the robot’s wrist. While the third-person camera is static, the egocentric camera enables the robot to actively control its vision to aid in precise manipulation. To fuse visual information from both cameras effectively, we additionally propose to use Transformers with a cross-view attention mechanism that models spatial attention from one view to another (and vice-versa), and use the learned features as input to an RL policy. Our method improves learning over strong single-view and multi-view baselines, and successfully transfers to a set of challenging manipulation tasks on a real robot with uncalibrated cameras, no access to state information, and a high degree of task variability. In a hammer manipulation task, our method succeeds in 75% of trials versus 38% and 13% for multi-view and single-view baselines, respectively.
Link to the site of project: https://jangirrishabh.github.io/lookcloser/.
Research paper: Jangir, R., Hansen, N., Ghosal, S., Jain, M., and Wang, X., “Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation”, 2022. Link to the article: https://arxiv.org/abs/2201.07779.