How convenient it would be if we could just talk to our computers, and our computers can respond back with a level of contextual accuracy? This is one intent of most AI engineers. However, it is not as easy as it sounds (no pun intended).

This is possible with the integration of the following concepts and technologies:

Text-to-speech:
This is the mechanism involved in enabling the computer to speak. Computers read codes and algorithms, making it appear that they are responding verbally.

Speech recognition:
The computer needs to be able to understand and process verbal commands.

And lastly, Natural Language Processing:
The computer should be able to associate semantics from recognized verbal cues or written text in order for it to process commands in a meaningful manner. Imagine your computer being able to “understand” what you are asking it to do.

Here’s how it will look like if you apply all these concepts:

This will feel familiar to all Star Trek fans out there! This program is called Space Nerds In Space. It is an open source multiplayer networked starship simulator for Linux. 

Text-to-speech applications have improved a lot where the computer can read out loud in a more conversational demeanor with the proper speech inflection. Most TTS applications are now available in several languages spoken in both male and female voices and with the appropriate intonation.

Compared to text-to-speech development, speech recognition is much more challenging. As evidenced by a scene in Star Trek, the 2009 movie. The Russian Starfleet officer of the USS Enterprise, Pavel Andreievich Chekov, has to enunciate his authorization code before the computer allowed him to make his broadcast. He has a thick Russian accent which the computer did not recognize. That may be fiction, but it just shows how daunting a task developing the perfect speech recognition program can be. Processing natural language is one of its seemingly insurmountable hurdles.

According to Ethnologue, they have catalogued 6,909 distinct languages as of 2009 and it seems that English has the dubious distinction of being the global language. Even so, English as a language has different variations depending on the region. Factor that in and developing a speech recognition software that would work across among all users would seem impossible. As seen in the video above, the commands have to be repeated several times before generating a response from the computer. Note that the commands are given by a native speaker of the language and yet there are times when the computer failed to understand. Be that as it may, the video shows us that we have come a long way towards a speech recognition program that can be used globally.

The author has generously shared his scripts and ideas which would help other developers in this field to come up with a robust software that synthesizes natural speech patterns and expressions. Currently, speech recognition powered by artificial intelligence works for navigation like GPS-connected digital maps and voice-dialing.

Ford has a technology that would allow users to talk to their cars. It uses an advanced voice technology that would tell the car what to do and would even provide options. Example – the user wants to play music, the car would ask what kind of music and would enumerate available playlists. The best thing, it only speaks when spoken to!

A speech recognition software can be speaker-dependent. A new user must “train” the software until it recognizes that user’s unique cadence and inflection. Speaker-independent speech recognition programs are designed to interact with anyone’s voice. This results in less accuracy and limited comprehension.

Even with its current limitations, speech recognition technology has proven to be very useful. People who cannot use their hands to type can dictate to the computer and translate their spoken word to written word. Google has added it to their browser. It has not been perfected yet, but speech recognition will evolve until human symbiosis with machines, i.e. the computer, will become a reality. Computers will understand and execute “Beam me up, Scotty!” spoken in different languages.

Add voice features to websites and apps across all smartphone, tablet and desktop devices. Try ResponsiveVoice — your effortless text-to-speech solution.

CTA

Close
Go top