Lip-reading program more accurate than humans could help hearing-impaired

Scientists have built a computer program that can lip-read better than humans. But why? Tech columnist Dan Misener has the answers.

Technology might also improve speech recognition and ultimately Siri, Alexa or Google Now

The lip-reading computer software system could accurately recognize 50 per cent of words while a human lip-reader could correctly read less than one-quarter of words. (Photo courtesy of Joon Son Chung)

Lip-reading is a notoriously tricky task. But researchers at the University of Oxford in the U.K. have created a computer program called Watch, Attend and Spell to do just that.

They claim their lip-reading algorithm is more accurate than human professionals.

Dan Misener is our tech columnist.

Why teach a computer to lip-read?

There are a number of reasons you might want a computer to lip-read and many of those have to do with accessibility.

For instance, a lip-reading computer could transcribe or add captions to video, make it easier for people to talk to their devices in noisy environments, or fill in the gaps during a video conference.

But, as it turns out, lip-reading is a difficult task for both humans and computers.

That's because our mouths often make the same shapes for different words, according to Joon Son Chung, one of the researchers at Oxford.

"So, for example, pat, bat and mat are visually identical," Chung said. 

If you only see a mouth and don't hear a voice, it's very difficult to tell the different between "bat" and "mat."

That's the challenge of getting a computer to lip-read.

But the reason we're talking about this today is that there have been several recent improvements in this field.

And in some cases, computers can now lip-read better than humans.

How did the team at Oxford teach a computer to do this?

The researchers created what they call Watch, Attend and Spell. It's a new artificial intelligence software system.

Watch, Attend and Spell was created using an approach known as machine learning. The researchers created an algorithm — a neural network — that could learn over time.

They trained the algorithm by showing it thousands of hours of TV news footage from BBC.

The advantage of TV news is that it's relatively high-quality video and it includes lots of different faces and speaking styles.

Plus, the TV shows they used to train the algorithm were already captioned by professionals. So they could match the mouth movements to transcriptions of what had been said on-screen.

Researchers trained the algorithm to watch mouth movements to identify words, such as these one-second clips that contain the word ‘about.’ (Photo courtesy of Joon Son Chung)

After the researchers trained their algorithm on these thousands of hours of TV, they put it to the test in the real world to see how it would perform on video without captions.

In other words, they wanted to see if their software could take what it had learned, and lip-read faces and mouths that it hadn't necessarily seen before.

How accurate was it?

It was surprisingly accurate.

It was able to get about 50 per cent of the words right.

Now, 50 per cent accuracy doesn't sound all that impressive until you compare it with human lip-reading experts. 

"We have given the same clips to the professional lip-readers and they seem to get less than one-quarter right," Chung said. 

So, the computer's performance is pretty impressive.

What privacy concerns do lip-reading computers raise?

When I first heard about this research, my mind immediately turned to that scene in 2001: A Space Odyssey, where they reveal that the HAL 9000 computer can lip-read.

I thought about all the cameras in the world around us that are constantly capturing video, such as smartphone cameras or security cameras.

If it's possible to figure out what someone is saying using only an image of their mouth, the possibilities for surveillance and eavesdropping seem pretty creepy.

I asked Chung about this, and he told me that the system doesn't pose a serious privacy risk right now.

That's partly because most security cameras aren't high-quality enough to make this type of lip-reading work.

He also pointed out the software's 50 per cent accuracy rate.

"Yes, it's true, it can lip-read better than a human, but it still gets half the words wrong when used without the audio. So it's not really useful for privacy intrusive scenarios," Chung said. 

Even if you got a clear, high-resolution video feed of someone, you couldn't know for certain exactly what they were saying.

Where might we see lip-reading computers in everyday life?

Like I said off the top, the researchers had accessibility in mind when designing this system. 

In particular, they thought about applications that could help people who are deaf or hard-of-hearing.

This technology also has the potential to significantly improve general-purpose speech recognition as well.

I don't know about you, but I'm often frustrated when I use voice-based services like Siri, or Google Now or Alexa. Sometimes they work well for me, but other times, these voice assistants get things very wrong.

Technology like Watch, Attend and Spell could improve voice-based services such as Siri, or Google Now or Alexa. (iStockphoto)

The researchers at Oxford believe that by combining voice recognition with lip-reading technology, that could dramatically improve the accuracy of these virtual assistants.

And there's another thing to consider: we tend to think of understanding speech as an auditory skill. But humans also pick up on visual cues to understand what's being said. 

In that way, when we combine speech recognition technology with lip-reading technology, we're building computer systems that mirror how humans perceive speech.

And if that can help Siri understand me a little better, that's a bonus.


Dan Misener

CBC Radio technology columnist

Dan Misener is a technology journalist for CBC radio and Find him on Twitter @misener.