Blog: Data Augmentation for Speech Recognition
Automatic Speech Recognition (ASR)
The objective of Speech Recognition is converting audio to text. This technology is applied in our life widely. Google Assistant and Amazon Alexa are some of the examples which taking our voice as input and converting to text to understand our intention.
Same as other NLP problem, one of critical challenge is lack of adequate volume of training data. It leads overfit or hard to tackle unseen data. Google Brain team with AI Resident come to tackle this problem by introducing several data augmentation method for speech recognition. This story will discuss about SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition (Park et al., 2019) and the following are will be covered:
To process data, waveform audio converts to spectrogram and feeding to neural network to generate output. Traditional way to perform data augmentation is normally applied to waveform. Park et al. go for another approach which is manipulate spectrogram.
Given a spectrogram, you can view it as an image where x axis is time while y axis is frequency.
Intuitively, it improves training speed because no data transformation between waveform data to spectrogram data but augmenting spectrogram data.
Park et al. introduced
SpecAugment for data augmentation in speech recognition. There are 3 basic ways to augment data which are time warping, frequency masking and time masking. In their experiment, they combine these ways to together and introducing 4 different combinations which are LibriSpeech basic (LB), LibriSpeech double (LD), Switchboard mild (SM) and Switchboard strong (SS).
A random point will be selected and warping to either left or right with a distance w which chosen from a uniform distribution from 0 to the time warp parameter W along that line.
A frequency channels [f0, f0 + f) are masked. f is chosen from a uniform distribution from 0 to the frequency mask parameter F, and f0 is chosen from (0, ν − f) where ν is the number of frequency channels.
t consecutive time steps [t0, t0 + t) are masked. t is chosen from a uniform distribution from 0 to the time mask parameter T, and t0 is chosen from [0, τ − t).
Combination of basic augmentation policy
By combing the augmentation policy of Frequency Masking and Time Masking, 4 new augmentation policies are introduced. While the symbols denote:
- W: Time Warping Parameter
- F: Frequency Masking Parameter
- mF: Number of frequency masking applied
- T: Time Masking Parameter
- mT: Number of time masking applied
Listen, Attend and Spell (LAS) Network Architecture
Park et al. uses LAS network architecture to verify the performance with and without data augmentation. It includes 2 layers of Convolutional Neural Network (CNN), attention and stacked bi-directional LSTMs. As the objective of this paper is data augmentation and the model is leveraged to see the impact of models, you can deep dive into LAS from here.
Learning Rate Schedules
Learning rate schedule turn out to be come a critical factor to determine model performance. Similar to Slanted triangular learning rates (STLR), a non-static learning rate is applied. Learning rate will be decay exponentially until it reaches 1/100 of its maximum value and keeping it as constant beyond this point. Some parameters are denoted:
- sr: Step of the ramp-up (from zero learning rate) is complete
- si: Step of the exponential decay starts
- sf: Step of the exponential decay stops.
Another learning rate schedule is uniform label smoothing. The correct class label is assigned the confidence 0.9, while the confidence of the other labels are increased accordingly. Parameter is denoted:
- snoise: Variational weight noise
In later experiment, three standard learning rate schedules are defined:
- B(asic): (sr, snoise, si, sf ) = (0.5k, 10k, 20k, 80k)
- D(ouble): (sr, snoise, si, sf ) = (1k, 20k, 40k, 160k)
- L(ong): (sr, snoise, si, sf ) = (1k, 20k, 140k, 320k)
Langauge Models (LM)
LM is applied to further boost up the model performance. In general, LM is designed to predict next token given consequence of previous tokens. Once a new token is predicted, it will be treat as “previous token” when predicting next tokens. This approach is applied in lots of modern NLP model such as BERT and GPT-2.
Model performance is measured by Word Error Rate (WER).
From the below figure, “Sch” denotes as learning rate schedule while “Pol” denotes as augmentation policy. We can see that LAS with 6 LSTM layer and 1280 embedding vector perform the best result.
By using LAS-6–1280 with SpecAugment perform the best result when comparing to other model and LAS without data augmentation.
In Switchboard 300h, LAS-4–1024 is applied to be benchmark. We can see that SpecAugment did help on further boost up model performance.
- Time warping did not improve model performance a lot. If resource is limited, this approach will be discarded.
- Label smoothing leads instability to training.
- Data augmentation converts over-fit problem to under-fit problems. From below figures, you can notice that the model without augmentation (None) perform nearly perfect in training set while no similar result is performed in other dataset.
- To facilitate data augmentation for speech recognition, nlpaug supports SpecAugment methods now.
I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. Feel free to connect with me on LinkedIn or following me on Medium or Github.
- Data Augmentation library for text
- Official release of SpecAugment from Google
- Slanted triangular learning rates (STLR)
- Bidirectional Encoder Representations from Transformers
- Generative Pre-Training 2
- D. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk and Q. V. Le. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. 2019
- W. Chan, N. Jaitly, Q. V. Le and O. Vinyals. Listen, Attend and Spell. 2015