Learning to understand grounded language—the language that occurs in the context of, and refers to, the broader world—is a popular area of research in robotics. The majority of current work in this area still operates on textual data, and that limits the ability to deploy agents in realistic environments.
Digital analysis of the end-user speech (or raw speech) is a vital part in robotics. Image credit: Kaufdex via Pixabay, free license
A recent article published on arXiv.org proposes to acquire grounded language directly from end-user speech using a relatively small number of data points instead of relying on intermediate textual representations.
A detailed analysis of natural language grounding from raw speech to robotic sensor data of everyday objects using state-of-the-art speech representation models is provided. The analysis of audio and speech qualities of individual participants demonstrates that learning directly from raw speech improves performance on users with accented speech as compared to relying on automatic transcriptions.
Learning to understand grounded language, which connects natural language to percepts, is a critical research area. Prior work in grounded language acquisition has focused primarily on textual inputs. In this work we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs. This will allow interactions in which language about novel tasks and environments is learned from end users, reducing dependence on textual inputs and potentially mitigating the effects of demographic bias found in widely available speech recognition systems. We leverage recent work in self-supervised speech representation models and show that learned representations of speech can make language grounding systems more inclusive towards specific groups while maintaining or even increasing general performance.
Research paper: Youssouf Kebe, G., Richards, L. E., Raff, E., Ferraro, F., and Matuszek, C., “Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded Language from Percepts and Raw Speech”, 2021. Link: https://arxiv.org/abs/2112.13758