Limitations of Voice Recognition Technology featured image

Limitations of Voice Recognition Technology

Reading Time: 4 minutes

A couple of years ago, only a handful of devices used voice recognition technology. Now, our phones, cars, navigation systems, and even home appliances seem eager to take verbal direction.  Everywhere we turn, we encounter a system that has voice control capabilities, and for good reason: efficiency and safety can drastically improve when we interface with a system hands-free. Based on the current trend, our relationships with Siri and Alexa are due to grow in the near future.  

But how near is ‘near’?  Despite advanced technologies in personal AI systems, are we really that close to having voice recognition all around us?  Why on earth isn’t voice recognition more common than it already is?  

Well, despite how far it has come, voice recognition technology still has a long way to go.  The obstacles to overcome aren’t widely known, but they are numerous and they severely hinder the technology’s capabilities.  

Below we outline a few of the reasons voice recognition technology is actually quite difficult to employ in our daily lives.  As it turns out, human language is a sophisticated system that is difficult to translate to the binary languages of computers. Who knew?  These hurdles, while easily pointed out, are complex to solve and must be overcome before further advancement comes our way.

Vocabulary and Accents

There are over 150,000 words in the English dictionary, and while the average person uses only around 20,000 of those, programming a system to recognize each one is an enormous feat.  When many words sound so similar to one another, computers can have a very hard time distinguishing between one phrase and another.

Even more problematic, each person has their own unique vocabulary, dialect, accent, and even slang.  The subtle differences from person to person may be noticeable to the human ear, but teaching a machine to do so is a completely different story.  Remember, our language and communication abilities have been developing for millions of years; modern computers have been around for thirty. The sheer variability from one speaker to the next, as well as one instance to another, is extremely difficult to accommodate.


As far as a computer is concerned, we talk very, very quickly.  Because language comes so naturally to humans and because we are immersed in it from a young age, we don’t notice how fast information is conveyed during verbal speech.  The average person speaks somewhere between 110 and 150 words per minute.

Keeping up with that is pretty difficult for computers.  It isn’t that computers aren’t fast enough to process information at that speed; it’s the natural flow of speech that comes with speed.  While some word groupings experience pauses in speech, others do not. Just because two words are separate doesn’t necessarily mean we insert a pause in the spoken sentence.  The result is that a computer that doesn’t know when one word starts and the next one begins– that’s like trying to read a written page with only some of the spaces included; not impossible, but certainly difficult.

Sound quality

Environmental and ambient noise can drastically reduce the effectiveness of speech recognition.  Just as signal-to-noise ratios impact the human ability to perceive, the same can be said for computers.  The sentence may have been spoken by the user or it may have been some guy at the table next door. Unless it is a very sophisticated system, the computer just doesn’t know what to listen to and what to ignore.    

On a different note, there are times when people do not speak clearly enough for the computer to understand.  When humans are under severe stress, for example, they are not likely to enunciate and slow their speech. In fact, they are probably going to do the opposite.  

Unfortunately, when the user is under stress, the use-case is probably some type of emergency, one when voice recognition working correctly is absolutely critical.  Yelling “Help! Find a doctor!” is definitely something the user wants to say once, and only once. The added variability introduced in safety-critical moments is something that immensely complicates the programming of the system.


Lastly, voice recognition technology simply isn’t compatible with some tasks.  While directing a washing machine to start a cycle isn’t too big a deal because it’s rather static, the idea of using voice control to complete continuous, dynamic tasks just don’t make sense.  Two examples that come to mind are playing video games or driving a car– controlling an entity in an extremely complicated, dynamic environment just wouldn’t work with speech. What does “Turn left!” really mean?  How far left? How long should it turn left for?

It is partially due to these reasons that we don’t see voice recognition as frequently as recent progress might suggest.  There may be hurdles to overcome in increasing the quality of current voice recognition, but many of the tasks in our daily lives simply aren’t compatible with speech recognition.  


Voice recognition has come a long way but still has a huge distance to go before we see more of it.  The variability of language, vocabulary, accents, and sound quality make error-free usage quite difficult.  At the very least, all these difficulties mean that tasks with even the smallest degree of importance probably aren’t going to incorporate voice recognition software.  As the future continues to arrive at our doorsteps, it will be interesting to see the breakthroughs in technology that may have an impact on voice recognition capabilities.  And before we know it, who is to say we won’t have cognitive control over our technologies?

Anders Orn photo

Anders Orn Director, Human Factors

Anders Orn is a Senior Human Factors Scientist. He loves usability studies because he enjoys studying and learning from people. In his spare time, he can be found outside, on a mountain, or at home with his family. You can find Anders on LinkedIn here.