Blog: Hearing the Character in Things: Alibaba Improves Mandarin Speech Recognition

Go to the profile of Alibaba Tech

This article is part of the Academic Alibaba series and is taken from the ICASSP 2019 paper entitled “Investigation of Modeling Units for Mandarin Speech Recognition Using DFSMN-CTC-sMBR” by Shiliang Zhang, Ming Lei, Yuan Liu, and Wei Li. The full paper can be read here.

If you’ve ever tried to learn Mandarin or even wondered how anyone manages to, it might help to know that cutting-edge speech recognition technologies also find it extremely difficult. Like people, large vocabulary continuous speech recognition (LVCSR) systems struggle with Mandarin’s distinct character-syllable relationships and immense lexicon, resulting in frequent confusion over homophones and out-of-vocabulary (OOV) problems. As the ultimate goal of these tools is to convert oral speech to a written record, such issues can cripple their effectiveness.

To tackle these challenges, researchers at Alibaba have now developed a DFSMN-CTC-sMBR neural network model to better transcribe human-to-human speech in Mandarin. With hybrid Character-Syllable units that combine the most common Chinese characters and their syllables, the model has dramatically reduced substitution errors to outperform conventional hybrid models in 20,000 hours of Mandarin speech recognition tests.

Reinterpreting Speech Recognition

As its name suggests, Alibaba’s DFSMN-CTC-sMBR model reflects innovation at several fronts of speech recognition.

Technical overview of the DFSMN-CTC-sMBR model

The model’s core function, connectionist temporal classification (CTC), is a relatively recent alternative to conventional hybrid models, which suffer from a limited choice of acoustic modeling units based on training criteria. Whereas previous work with CTC has mainly employed long short-term memory (LSTM) neural networks, this model adopts a deep feedforward sequential memory network (DFSMN) to enhance its performance with both context-independent (CI) and context-dependent (CD) phones as target labels. Drawing on an optimization technique for LSTM-CTC, it further incorporates a state-level minimum Bayes risk (sMBR) criterion to support sequence-level discriminative training.

Taken as a whole, the key achievement of this design is its support for a greater range of acoustic modeling units well-suited to Mandarin speech. A major consideration in Mandarin is the relationship between initial tones and final tones in words: There are 23 possible initial syllables and 35 final syllables that can vary among five tones each, for a total of 185 finals. To account for this variance, researchers incorporated a context independent initial/final (CI-IF) unit that recognizes all 23 initials and 185 tonal finals, as well as a context-dependent (CD-IF) unit for an additional 7,951 pair relationships determined by a data-driven decision tree.

Additionally, the model features a Syllable unit for individually modeling Mandarin’s 1,319 tonal syllables and two hybrid Character-Syllable units that target a set of 2,000 and a set of 3,000 commonly used Chinese characters, respectively. As Mandarin’s thousands of syllables correspond to tens of thousands of characters, this greatly improves the model’s ability to distinguish among homophones and eliminate OOV problems in which a syllable cannot be properly matched to the correct character.

Overview of the DFSMN-CTC-sMBR model’s acoustic modeling units

Honing Results, Unit by Unit

To test the proposed model, Alibaba’s researchers used roughly 20,000 hours of Mandarin audio data from news, sports, tourism, gaming, literary, and educational settings, using character error rate (CER) percentage as the key performance metric. In trials, it faced off against various alternative configurations built with some but not all of the acoustic modeling units discussed in the previous section.

Performance results for different modeling unit configurations with and without sMBR training; lower scores indicate better performance.

As shown above, results indicate that including sMBR training can improve models’ relative performance by more than 10% over base CTC training. More importantly, the DFSMN-CTC-sMBR model that incorporated all acoustic modeling units (CI-IF, CD-IF, Syllable, Char(2k)+Syllable and Char(3k)+Syllable) achieved the lowest error rate, validating the efficacy of these units in tackling Mandarin-specific challenges.

The full paper can be read here.

Alibaba Tech

First hand and in-depth information about Alibaba’s latest technology → Facebook: “Alibaba Tech”. Twitter: “AlibabaTech”.

Source: Artificial Intelligence on Medium

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top

Display your work in a bold & confident manner. Sometimes it’s easy for your creativity to stand out from the crowd.