Blog: Which is better for Speech-to-Text (STT): CNNs or RNNs?
Speech-to-text can be handy for hands-free transcription! But which neural model is better at the task: CNNs or RNNs?





Let’s decide by comparing the transcriptions of two well-known, pre-trained models: Wavenet (CNN) & DeepSpeech (RNN)
WaveNet




Wavenet is best known for its state of the art performance in speech synthesis (text-to-speech), however, it can be trained to recognise speech and transcribe audio (speech to text) as described in the original paper! Fortunately, there are helpful people out there who have already configured and trained Wavenet for speech recognition! Lets clone one such repo:
!git clone https://github.com/buriburisuri/speech-to-text-wavenet
Then install any and all dependencies it needs
!pip3 install tensorflow==1.0.0
!pip3 install sugartensor
Now lets download the pre-trained weights for the model and unzip it




Then copy the four files (show above) into a new directory called “train” and save that inside a new directory called “asset”. Stick “asset” and its contents into the repo you just cloned. The model will be looking for these four files in the following path (“speech-to-text-wavenet/assets/train/*”).




Now you can download some audio files and test it out. Here are some funny audio clips of people saying the phrase “cottage cheese with chives is delicious”
!curl -O https://www.ee.columbia.edu/~dpwe/sounds/sents/sm3_cln.wav
!curl -O https://www.ee.columbia.edu/~dpwe/sounds/sents/sm1_cln.wav
!curl -O https://www.ee.columbia.edu/~dpwe/sounds/sents/sf3_cln.wav
!curl -O https://www.ee.columbia.edu/~dpwe/sounds/sents/sf1_cln.wav
!curl -O https://www.ee.columbia.edu/~dpwe/sounds/sents/sm2_cln.wav




We can test the Wavenet STT running the “recognize.py” file in the repo




Not bad considering it doesn’t use a vocabulary of words or a language model
DeepSpeech
Now lets try out DeepSpeech which has become well known for its state of the art STT performance.




Lets download deepspeech and any pretrained weights, etc
!pip3 install --upgrade deepspeech
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.4.1/deepspeech-0.4.1-models.tar.gz
!tar xvfz deepspeech-0.4.1-models.tar.gz
Now lets load up the model
from deepspeech import Model
ds = Model('models/output_graph.pbmm', 26, 9, 'models/alphabet.txt', 500)
ds.enableDecoderWithLM('models/alphabet.txt', 'models/lm.binary', 'models/trie', 1.5, 2.1)
Lets quickly write some helper functions to load in audio files to the model in the correct format and get their frame rates, etc
import numpy as np
def load_audio(audio_path):
fin = wave.open(audio_path, 'rb')
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
fin.close()
return audio
import wave
def frame_rate(audio_path):
fin = wave.open(audio_path, 'rb')
sample_rate = fin.getframerate()
fin.close()
return sample_rate
And we are ready to test it out:
audio_file = "sm3_cln.wav"
ds.stt(load_audio(audio_file), frame_rate(audio_file))


Conclusion: WaveNet or DeepSpeech?
So which is the best? Here are the results of wavenet and deepspeech on all five audio files downloaded. I’ll leave it up to you to decide




Leave a Reply