There’s a growing trend of people relying more and more on AI personal assistants like Siri from Apple, Alexa from Amazon, Cortana from Microsoft and the Google Assistant. Apple and Microsoft are using concatenative text to speech programs where they make recordings of real human voices, rearranged them and combined in small bits. The result is quite realistic but there’s no library in existence containing recordings of every possible sound. The audio must also be consistent with the persona of Siri and Cortana.
The option then is to use parametric text to speech which doesn’t require a human recording the sounds. Codes are used to build a completely computer-generated voice which sounds robotic and stilted.
Google’s DeepMind artificial intelligence or WaveNet offers a better option. It can now have the ability to produce some of the most realistic human voices. What it does is to model audio waveforms from actual human voices samples and create its own sounds capturing the subtleties of human speech. It might not be as realistic as an actual person speaking but so far, it is much better than the sound produced by other text-to-speech programs. WaveNet has provided samples of speech generated that almost mimics a real human voice which on its own are just sounds with no content. As explained in their paper, Wavenet: A Generative Model For Raw Audio:
Because the model is not conditioned on text, it generates non-existent but human language-like words in a smooth way with realistic sounding intonations… We observed that the model also picked up on other characteristics in the audio apart from the voice itself. For instance, it also mimicked the acoustics and recording quality, as well as the breathing and mouth movements of the speakers.
For it to become meaningful speech, linguistic rules and suggestions are applied to the results such as syntax and grammar.
WaveNet has also shown that other audio signals can be synthesized such as automatically generated piano music. This is less complicated than producing speech as real as a human voice speaking.
Blind tests conducted have proven that listeners rated WaveNet as the most realistic among text-to-speech programs. It has truly revolutionized text-to-speech programs. Before WaveNet, most computer gurus are of the opinion that the greatest challenge for text-to-speech development is to produce high quality synthesis. With the results from Google DeepMind, that challenge has been overcome.
Currently, WaveNet is still not available commercially because of its enormous computer power requirements. Nevertheless, no one will dispute it is ground-breaking. When the time comes that Google has successfully harnessed WaveNet and is able to make the technology available for servers and small devices, we would have realized the dream of being able to talk with machines.