ProjectBlog: Which is better for Speech-to-Text (STT): CNNs or RNNs?

Blog: Which is better for Speech-to-Text (STT): CNNs or RNNs?

Speech-to-text can be handy for hands-free transcription! But which neural model is better at the task: CNNs or RNNs?

(Left) deep CNN vs deep bi-directional RNN (Right)

Let’s decide by comparing the transcriptions of two well-known, pre-trained models: Wavenet (CNN) & DeepSpeech (RNN)


Wavenet uses a CNN architecture (with dilated convolutions)

Wavenet is best known for its state of the art performance in speech synthesis (text-to-speech), however, it can be trained to recognise speech and transcribe audio (speech to text) as described in the original paper! Fortunately, there are helpful people out there who have already configured and trained Wavenet for speech recognition! Lets clone one such repo:

!git clone

Then install any and all dependencies it needs

!pip3 install tensorflow==1.0.0
!pip3 install sugartensor

Now lets download the pre-trained weights for the model and unzip it

Then copy the four files (show above) into a new directory called “train” and save that inside a new directory called “asset”. Stick “asset” and its contents into the repo you just cloned. The model will be looking for these four files in the following path (“speech-to-text-wavenet/assets/train/*”).

The files in the cloned repository including the newly created folder ‘asset’ containing the downloaded files

Now you can download some audio files and test it out. Here are some funny audio clips of people saying the phrase “cottage cheese with chives is delicious”

!curl -O
!curl -O
!curl -O
!curl -O
!curl -O
‘cottage cheese with chives is delicious’

We can test the Wavenet STT running the “” file in the repo

wavenet STT: generated the output “cottige cheese with chives is delicious” for the audio file “sf3_cln.wav”

Not bad considering it doesn’t use a vocabulary of words or a language model


Now lets try out DeepSpeech which has become well known for its state of the art STT performance.

DeepSpeech uses an RNN architecture (bi-directional LSTMs)

Lets download deepspeech and any pretrained weights, etc

!pip3 install --upgrade deepspeech
!tar xvfz deepspeech-0.4.1-models.tar.gz

Now lets load up the model

from deepspeech import Model
ds = Model('models/output_graph.pbmm', 26, 9, 'models/alphabet.txt', 500)
ds.enableDecoderWithLM('models/alphabet.txt', 'models/lm.binary', 'models/trie', 1.5, 2.1)

Lets quickly write some helper functions to load in audio files to the model in the correct format and get their frame rates, etc

import numpy as np
def load_audio(audio_path):
fin =, 'rb')
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
return audio
import wave
def frame_rate(audio_path):
fin =, 'rb')
sample_rate = fin.getframerate()
return sample_rate

And we are ready to test it out:

audio_file = "sm3_cln.wav"
ds.stt(load_audio(audio_file), frame_rate(audio_file))
DeepSpeech STT: generated output for the audio file “sm3_cln.wav”

Conclusion: WaveNet or DeepSpeech?

So which is the best? Here are the results of wavenet and deepspeech on all five audio files downloaded. I’ll leave it up to you to decide

Source: Artificial Intelligence on Medium

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top

Display your work in a bold & confident manner. Sometimes it’s easy for your creativity to stand out from the crowd.