The Cone of Silence: Speech Separation by Localization

Humans can separate and localize sounds in a noisy environment; however, hearing a conversation partner in background noise may still be difficult. Future earbuds could solve this problem by selectively canceling audio sources that a user does not want to listen to.

A recent paper on arXiv.org proposes a deep neural network technique that can cancel all audio sources outside a specified angular region. An arbitrary number of potentially moving speakers can be handled in logarithmic time. This network can be used for sound localization and audio source separation.

Background noise, such as music or ambient noise, can be ignored. In the experiments, the suggested technique was proven to work in real-life scenarios, such as in separating people on different phone calls or two speakers walking around a table. The method can also be an alternative to camera-based tracking and recognition in robotics.

Given a multi-microphone recording of an unknown number of speakers talking concurrently, we simultaneously localize the sources and separate the individual speakers. At the core of our method is a deep network, in the waveform domain, which isolates sources within an angular region θ±w/2, given an angle of interest θ and angular window size w. By exponentially decreasing w, we can perform a binary search to localize and separate all sources in logarithmic time. Our algorithm allows for an arbitrary number of potentially moving speakers at test time, including more speakers than seen during training. Experiments demonstrate state-of-the-art performance for both source separation and source localization, particularly in high levels of background noise.

Link: https://arxiv.org/abs/2010.06007

Source