Google uses AI in Learning Human Speech - ResponsiveVoice for Text to Speech

As early as the 1700’s, there have been attempts to make machines emulate human speech and the results were robotic and barely intelligible. Scientists have come a long way in training computers to understand voice commands, but the technology for text-to-speech (TTS) has fallen behind. Even today, the synthesized voices can still be distinguished from actual human speech.

Google’s DeepMind unit might achieve this feat of making computers talk that is indistinguishable from a human voice. The researchers of this elite unit recently announced a breakthrough in producing voices that almost mimic human speech using artificial intelligence. Voice samples made available by the DeepMind team are impressive and by far, the best quality in computer generated speech.

Speech Synthesis or Text-to-speech is largely based on concatenative TTS, where a very large database of short speech fragments is recorded from a single speaker and then recombined to form complete sentences. This limits the ability to modify the voice without recording an entire new database.

DeepMind uses a sophisticated WaveNet algorithm to produce the sound by using artificial intelligence to predict speech patterns. It’s designed to mimic human brain function and is capable of handling 16,000 voices samples per second. It would then statistically choose what voice to use and piece them together to produce a raw audio.

This capability to produce raw audio allows the program to transition across multiple voices and researchers are saying that further additions of emotions and accents will produce speech that sounds even more realistic. “To make sure [WaveNet] knew which voice to use for any given utterance, we conditioned the network on the identity of the speaker,” the researchers said.

Listening to the WaveNet-generated speech samples, they indeed sound almost human and by far the most natural computer generated speech I’ve heard. The time may come when we’ll hear computers talk like Jarvis of Iron Man. DeepMind using WaveNet will one day make science fiction a reality.

In addition, WaveNet is not limited to speech alone. It can also make music. According to researchers, “As well as yielding more natural-sounding speech, using raw waveforms means that WaveNet can model any kind of audio, including music.”

As impressive as the technology is, it requires massive computational power which will constrain its commercial use at this stage in time. It is not a question of ‘if’ but a question of ‘when’ this speech synthesis technology will be perfected. Its release in the mainstream, by DeepMind, will have repercussions globally. It will benefit those who have lost their ability to speak while it will adversely affect those whose jobs depend on the use of their voice.

Thinking about voice-enabling your site or app? ResponsiveVoice is a HTML5-based Text-To-Speech library designed to add voice features to websites and apps across all smartphone, tablet and desktop devices. It supports 51 languages through 168 voices, no dependencies and weighs just 14kb. Try ResponsiveVoice now!

Read More: “Google Just Got This Much Closer to Mimicking Human Speech Using AI” by Glenn Leibowitz