Speech recognition technology has developed in sophistication and accuracy in recent years and is achieving levels of success which is helping to win more attention and wider commercial adoption. The latest developments enable natural, human-like dialogue so customers can engage in a more conversational way.

For instance, a speech automated call centre system might ask, “How can I help you today?” The technology is so advanced it can derive meaning from unstructured responses from callers, not just by recognising certain words but also the context in which something is said.

Speech recognition has met with resistence in the past largely due to peoples’ experiences of out-dated versions or poor implementations of what is now state of the art technology. The technology has made significant breakthroughs in recent years. It is able to adapt to local dialects and regional accents for improved recognition accuracy. The technology can also filter out background noise improving speech clarity and raising accuracy rates dramatically for wireless, hands-free, and noisy environments.

Building a System

The art of speech recognition technology is in its simplicity. Its ease of use and near ubiquity belies the complex work that goes into the backend by an army of talented developers and linguists. Building a system capable of understanding natural language input involves collecting 30,000 or more caller utterances that represent seasonal and monthly trends in call patterns, and using these to build a model of how people talk.

Interestingly, it is often easier for speech recognition to understand long words like “Supercalifragilisticexpialidocious” than short words that sound similar, like “hat” and “cat”. The system’s engineering is so intelligent that the technology can understand as much as a human can. And as call centre agents might ask callers to repeat themselves for clarity, so can a speech recognition solution.

Because the technology has been fine tuned over a period of years, the amount of time it takes to customise a system for a specific customer has contracted considerably, as has the cost. Although analysis must be undertaken to establish the types of conversations which will take place, we are moving towards a system of building blocks in which data can be rapidly sourced from a speech database in order to implement the solution quickly.

The more deployments that are rolled out, the more data is collected about how people speak, and the more the underlying acoustic models, the algorithms that represent the sounds made by speech, can be refined to take into account a wider variety of speaking environments. By continually updating these algorithms of speech we can improve the intelligence of the system and so enable it to understand a wider variety and complexity of speech input.

Beyond these types of technological improvements, the speech industry also continues to strive towards natural sounding synthesised speech in order to allow for its use in reading out information and instructions in a broader range of applications.

Synthesised speech is especially useful when encountering difficulties with recording large volumes of audio files with a human voice artist or there is limited memory on a device for storage of audio. Primarily the aim here is to make the emphasis and tonality of the system more human and less like a machine, and in doing so, more appealing when users are listening to long passages of audio.

Beyond Speech

The sound of a brand is also a key factor for consideration. Some of the world’s biggest companies invest millions of pounds each year ensuring that the way their brands ‘look’ and ‘feel’ reflect the values and beliefs of the brand.

Yet very few organisations actually think about how their brand ‘sounds’, despite the fact that the vast majority of customer service communication and advertising is based on listening. We buy from people we trust, so hearing someone’s voice representing a company or selling to us is just the same. It’s one of the most important and overlooked areas of marketing.

Developments in the technology have allowed for its application across multiple modalities. Speech technology on mobile devices is already emerging as a killer application. For instance, some iPhone apps allow users to speak their text messages or social media updates.

As speech recognition technology evolves and learns, the boundaries are constantly pushed. Speech input on mobile phones for instance allows the user to say a variety of free form commands which is the first step in the long roadmap of increasingly natural interaction with devices.

The greatest challenge for the future of speech recognition is understandably how to add more humanity to the technology. When a person reads a piece of text, a given context or situation will inform their tone and speed. This element is currently under development. The next generation of speech technology will not only understand what someone is trying to say, but will analyse the tone and respond accordingly. The interaction between the system and the user will be seamless and natural.