Although speech recognition software has been around for decades, and is now built into many smartphones and PCs, not everyone feels comfortable using it. Why? Well, possibly because asking a computer to ‘listen’ in the same way as human beings, is a pretty complex area of computer science that doesn’t always come off.

A recent news story about the perils of auto correct when texting (‘Mother accidentally requests ‘wee blind girl’ on daughter’s 21st Birthday cake after autocorrect fail’) got me thinking about why speech technology is so often such an ordeal for people who have to use it. Many people find dealing with an IVR (Interactive Voice Response) telephone system, when calling a company with a query, can give them a huge headache.


Not only does the IVR ask you endlessly to ‘press 1’ for this option or ‘press 2’ for that option, but the recorded voice at the end of the line now often wants you to give more detailed information– what your query is about, your reference number, what town you are calling from and so on. And that’s where the problems can start.

Any sort of background noise, a poor telephone line or even a ‘non-standard’ accent can give the speech recognition technology a challenge as it struggles to ‘understand’ what’s being said. Of course it has a database of vocabulary to draw on and a set of algorithms – self-contained, step-by-step set operations – which it can perform to interpret those words in a certain sequence, but sometimes what the person at the end of the line is saying, just doesn’t match anything it’s familiar with. (For a humorous example of this, see Barry Kripke in ‘The Big Bang Theory’ with a bad case of rhotacism, trying to use Siri on his iPhone).


The term ‘speech recognition’ suggests that the language is being ‘understood’, but in fact, the technology, as well as learning a huge number of words, also recognises the relative frequency with which particular words occur in a particular sequence or in close proximity; and in this way, it can be ‘trained’ to make educated guesses on what words in a certain sequence mean even when they are homophones.

It’s a bit like hearing someone speaking in a foreign language. You only need to recognise a few words out of many and your brain will fill in the gaps for you from the context. When we speak, our voices generate little packages of sound called ‘phones’ and the blocks of sound that the ‘phones’ make up are called ‘phonemes’. Spoken languages are built up from these phonemes. English uses about 46 phonemes, while Spanish only has about 24.


So, when we hear someone talking, our ears hear the sounds and our brains translate them instantaneously into logic – sentences, thoughts, ideas – so that most of the time it all makes perfect sense. Occasionally, there’s some confusion when our brains try and make sense of words that sound as though they might be ‘right’, but don’t actually mean anything. For example, there’s a well-known history of song lyrics being misheard and misunderstood – a phenomenum known as ‘mondegreen’. (In the Jim Hendrix song “Purple Haze”, you would be more likely to hear Hendrix singing that he is about to kiss this guy than that he is about to kiss the sky.)

So you can imagine that speech technology has even more of a challenge than human brains. It was only around a decade ago that systems learnt to distinguish between discrete and continuous speech – that is, words spoken slowly with pauses in between them and words run together as they usually are in everyday conversation. Almost all modern systems are now capable of understanding continuous speech in the way it is normally spoken, but the next step forward has been a while coming.


Recently our company [24]7 has developed a customer engagement platform that integrates Microsoft’s DNN (Deep Neural Networks) or Deep Learning technology and gives an expected 95 percent accuracy to speech recognition technology – a huge improvement of 25 percent over previous models. It’s the very first on the market and is being put into practical use by car rental company Avis Budget Group.

This is a huge breakthrough in artificial intelligence (AI) and potentially a game changer for IVR and customer service. It draws on billions of vocalisations gleaned from Microsoft products, including customer-facing products such as Bing, Xbox and Cortana, making it the most advanced on the market. The outcome, as this type of integrated platform becomes more widely used, should be a lot less stress for customers when dealing with IVRs and a much better reputation all round for self service customer service.