Blog: 6 Questions for Doug Schumacher
The CEO of Arrovox riffs on the #VoiceFirst trend
1. Thanks for agreeing to field some, er, queries, Doug. Much appreciated! First off, how did you get into this whole voice space, and what are you up to now?
DS: I have a background in writing, especially for audio, and that combined with a lot of experience in digital marketing and product development, and I felt that gels well for voice.
Currently, through my company Arrovox I’m writing and producing the #VoiceFirst podcast Homie & Lexy, host the VoiceMarketing podcast, recently launched the Audio Museum of Art voice app for the Alexa platform (or “Alexa, open Audio Museum of Art”), and am working on some other solutions integrating voice and marketing.
2. In the 90s, Interactive Voice Recording — “Welcome to Moviefone, dial 1 for ‘Jurassic Park,’ dial 2 for ’The Fugitive’…” — was all the rage. It remains standard if you ever deal with a power company or a bank over the phone. There’s been gradual improvement over the years with Automatic Speech Recognition, allowing consumers to fully state their intentions and be understood. Still, it’s a clunky system. How soon until we move toward an entirely tit-for-tat, word-for-word style interaction in what are now IVR-based systems?
DS: A long time, IMHO. I’m not the futurist on this, and Brian Roemmele will say we’re just about there, but I look at how hard it is just to get the right song to play sometimes, and I think we’re a ways off.
3. How is writing for voice different from traditional copywriting? In my own experience, writing for voice moves one closer to “microcopy”: extremely brief sentences that may even forgo most traditional punctuation, like periods and commas. Part of this is due to the unattractive and drone-y way synthetic speech sounds, which can subtly prompt people to “check out,” so brevity is preferred. Thoughts?
DS: I think of writing for voice as being very navigational, until the user reaches the experience they’re coming for, and then it’s more of a content experience. So I think of writing for voice as having multiple roles.
4. When I worked on Bixby, one way we’d experiment with correct text-to-speech before crafting replies to user queries was to utilize the “custom chat” feature included on higher-end Samsung phones; with custom chat, you can type a word and hear Bixby attempt to vocalize it. The names of cities and people were often a headache, so you’d experiment with different spellings until Bixby got it right, or well-enough approximated getting it right. How big a problem is the long tail of fairly unique “entity” names to the challenge of improving synthetic speech in voice agents?
DS: I’m not sure I totally understand the question, but with poor voice recognition, it’s going to hurt the accuracy of invocations. That seems like one situation where better ASR will help, but ASR is also needed throughout the app experience. Right now, it seems like I can’t get through the first 4–5 exchanges with most 3rd party apps.
5. I first came across the term “endpoint detection” in the O’Reilly book “Designing Voice User Interfaces” by Cathy Pearl. Endpoint detection is the point at which a voice agent or ASR processor decides you’ve stopped talking and deploys its reply. In my experience, the amount of time given to a user to wrap up their thoughts is entirely too brief. “Where can I find a good Chinese…” — and boom, Google Assistant responds, not allowing me to finish my query that included the words “restaurant” and “in Fremont.” This problem is especially severe (for some reason) when utilizing speech-to-text in Messenger. Thoughts?
DS: That’s interesting, but isn’t something I’ve experienced.
6. What are your thoughts on voice technology in the e-commerce space? Shopping for tangible goods has been an uphill battle for voice-only experiences, in a way it hasn’t for chatbots, which can utilize cards and graphics to help with visual confirmation of products. The Echo Show is of course an example of a smart speaker that overcomes this problem, but when will everyone else catch up, either by adopting a screen ala Echo Show or getting around the necessity of it entirely?
DS: I think voice as the nav for visual shopping is going to be huge. Shopping is mostly searching, and search is already one of the most popular use cases for voice.
More broadly, there are core shared values between the content marketing and voice app industries. Voice solutions have a personal feel too, certainly more than video. There’s a “theater of the mind” aspect at work that could be harnessed in ways that benefit marketers.