AI Speech Synthesis Evolution – Baidu and Realistic Human Text to Speech and More! - ResponsiveVoice for Text to Speech

Baidu is truly challenging Google! Recently, Baidu announced that it plans to open source its software for self-driving cars to accelerate its development. Just a few months back, this tech titan released its new innovation in the text-to-speech technology that’s way ahead of Google’s Wavenet. Baidu’s project is called Deep Voice and unlike WaveNet, Deep Voice can be trained to synthesize speech in just a few hours with minimal human interaction. Deep Voice is also set to synthesize natural human speech that is able to convey emotions.

The biggest obstacle to building such a system thus far has been the speed of audio synthesis – previous approaches have taken minutes or hours to generate only a few seconds of speech. We solve this challenge and show that we can do audio synthesis in real-time, which amounts to an up to 400X speedup over previous WaveNet inference implementations. – Baidu

Deep Voice Tech Explained

Baidu’s Deep Voice technology uses deep-learning techniques to convert text to sound in all its processes. While other text-to-speech solutions and systems convert text to sound using complex processing pipelines that operate in multiple stages, Baidu’s Deep Voice is able to avoid a huge amount of processing and engineering. This renders Baidu’s solution more applicable to different problem domains in speech synthesis.

By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component requires laborious feature engineering and extensive domain expertise. – Deep Voice: Real-time Neural Text-to-Speech

The same technology is also able to smartly predict the duration and frequency of each phoneme in a word or sentence. With the ability to combine and switch phonemes and alter syllables, Deep Voice is capable of conveying different emotions in synthesized speech.

The advanced speech technology market is a billion dollar industry. With more relevant and practical use for education, accessibility, and for any device that is designed for convenience, text-to-speech is integral. It is not surprising that artificial intelligence is being deployed by giants like Google and Baidu (which is more dominant over Google in China) to make leaps and bounds with the text-to-speech market.

Let’s give this technology a few more years and we can finally have multiple options of realistic human voices to choose from in our devices. We are excited to hear Deep Voice implementations in actual products soon. As of this time, our current devices are unable to handle Deep Voice’s requirements in processing data. All we can do is wait.

As a bonus: Baidu AI Team’s Andrew Ng’s TEDx interview is pretty impressive too! (the video is unrelated, however, he did talk about #AI in education and #gamification in this video):

Reach more audience! Add voice features to websites and apps across all smartphone, tablet and desktop devices. Try ResponsiveVoice — your effortless text-to-speech solution.