Blog: Denoising the VOiCES Dataset
Regular readers of Gab 41 will know that Lab 41 has been studying monaural audio denoising — the removal of noise from audio signals collected with a single microphone — for some time. Our previous blog post on Source Contrastive Estimation describes our research approach in detail — and earlier this year, we were invited to present our work at Interspeech in Hyderabad India.
A key focus of our work to date has been the removal of dynamic noise — sources of noise which do not repeat, or do so only sporadically. This category of noise poses a problem for traditional denoising methods but can be removed with modern machine learning denoising systems. As described in the above post, we chose to simulate dynamic noise sources by combining short clips of street noise with clean recordings of people reading passages of text. This is known as ‘additive noise’ and is a standard method for simulating noisy audio data.
Simulated data produced in this way was great for training denoisers to remove non-stationary noise, but it couldn’t capture the intricacies of real-world room acoustics. Effects like reverberation can be hard to model and synthesizing realistic far-field microphony data is notoriously challenging. This need for training data containing natural noise motivated us to create the Voices Obscured in Complex Environmental Settings (VOiCES) dataset in collaboration with SRI. You can read more about the VOiCES dataset here. The VOiCES corpus was collected by recording replayed speech and noise in real rooms with varying acoustic profiles. These far-field recordings, collected using several microphones placed throughout the room, capture natural reverberation, curated distractor noise (television, music or babble), non-curated dynamic noise (pipes, people walking outside room, etc.), and static noise (mostly light fixtures and HVAC).
Equiped with our strong denoising models and our brand new realistically noisy dataset — the next step seemed obvious: let’s see how good our models are at removing this new kind of noise. As it turns out, they were significantly less effective when faced with complex natural noise rather than the simple overlay of two waveforms seen in additive noise.
Denoising models trained on additive noise performed poorly on far-field audio.
There are likely several reasons for this drop in performance: VOiCES makes use of a different set of noise sources than the additively generated data. Our models were originally trained using noise from the Urban8K dataset, a collection of recordings made on the streets of New York City. There’s some overlap with the ‘distractor’ noises used in VOiCES — for example, both have clips of traffic and car horns — but there are many kinds of sounds that appear in only one of the two datasets (e.g. jackhammers in Urban8K and the sound of television in VOiCES). Specifically, VOiCES uses human speech as one category of distractor whereas our original additive noise dataset did not — unsurprisingly, this is a particularly challenging class of noise to remove.
Additionally, the signal-to-noise ratio (SNR) ranges vary between the two datasets. In our applications, SNR describes the relative volumes of the speaker and the noise with a higher SNR representing cleaner audio. In our original dataset SNRs ranged from -5 to +5 dB while in VOiCES the average SNR for distant audio is 20 dB… So that means VOiCES is, on average, less noisy — and yet our models performed worse. What’s up with that?
This finding actually makes sense — the key is that our models are trained on data with a particular range of SNRs and on a particular kind of non-stationary noise. At the risk of anthropomorphizing — they expect the data they receive to fall within that range, and if it doesn’t, they may over-correct and end up distorting the input rather than denoising it.
Lastly, and most fundamentally, the VOiCES dataset contains a variety of intricate noise types which simply aren’t present in our original data. After all, capturing examples of these noise types, which are absent from our original near-field additive noise data, was the motivation for creating VOiCES in the first place. Whatever the reasons, our existing models weren’t very good at denoising VOiCES. While this negative finding is noteworthy in its own right, we plan to continue to explore other approaches to denoise natural far-field recordings.