Blog: Blog Post #3
Part II: Wrangling with Google Cloud’s Speech-to-text
One of my greater accomplishments in my time in a cohort of General Assembly’s Data Science Immersive was a group project developed for a real world client (General Assembly is well connected, and someone ostensibly has some government contacts). New Light Technologies was the client. They are a small DevOps outfit D.C. that “provides comprehensive information technology solutions for clients in government, commercial, and non-profit sector.” The technologies they were asking of the students were all focused on some form of disaster relief/preparedness. Our prompt was the following:
Currently, FEMA identifies areas that require immediate attention (for search and rescue efforts) either by responding to reports and requests put directly by the public or, recently, using social media posts. This tool will utilize live police radio reports to identify hot spots representing locations of people who need immediate attention. The tool will flag neighborhoods or specific streets where the police and first-respondents were called to provide assistance related to the event.
Before we can attempt to extract location we need to transcribe our audio. Here we used Google Cloud Speech-to-Text on our speech samples. Google’s is a highly reputed and easy-to-use transcription API. It is also about as “black box” as it comes. The Speech-to-Text is a completely proprietary neural network that we only know by the output it gives us and the inputs it affords us.
Notably, we are able to provide a vocabulary list. This list is used as what are known as “contextual embeddings”. Contextual embeddings essentially skew the weight of the terms provided favorably over those that the speech recognition client would otherwise predict as the word spoken. Given we know the names of all roads we expect to hear, we use our scraped list as embeddings, thereby giving the otherwise “universal” speech recognition client a context. The embeddings do not deactivate the training of the neural network however, so we will see the impact that our embeddings have on the confidence score that the Google Cloud Speech-to-Text also provides.
Below we define two functions, one which retrieves files, and the other which transcribes files and expects
speech_contexts to be provided. We then run the two functions using
pool multiprocessing which increases the speed of the call. Our speech samples are transcribed and returned as a pandas dataframe. Finally we take the average confidence score that Google gives in order to score this run with the following.