Tech Companies Are Training AI to Read Your Lips

A patient sits in a hospital bed, a bandage covering his neck with a small opening for the tracheostomy tube that supplies him with oxygen.

Because of his recent surgery, the man featured in this marketing video can’t vocalize. So a doctor holds up a smartphone and records the patient as he mouths a short phrase. An app called SRAVI analyzes the lip movements and in about two seconds returns its interpretation—”I need suction.”

Videos by VICE

It seems like a simple interaction, and in some respects, SRAVI (Speech Recognition App for the Voice Impaired) is still pretty simplistic. It can only recognize a few dozen phrases, and it does that with about 90 percent accuracy. But the app, which is made by the Irish startup Liopa, represents a massive breakthrough in the field of visual speech recognition (VSR), which involves training AI to read lips without any audio input. It will likely be the first lip-reading AI app available for public purchase.

Researchers have been working for decades to teach computers to lip-read, but it’s proven a challenging task even with the advances in deep learning systems that have helped crack other landmark problems. The research has been driven by a wide array of possible commercial applications—from surveillance tools to silent communication apps and improved virtual assistant performance.

Liopa is in the process of certifying SRAVI as a Class I medical device in Europe, and the company hopes to complete the certification by August, which will allow it to begin selling to healthcare providers.

While their intentions for the technology aren’t clear, many of the tech giants are also working on lip-reading AI. Scientists affiliated with or working directly for Google, Huawei, Samsung, and Sony are all researching VSR systems and appear to be making rapid advances, according to interviews and Motherboard’s review of recently published research and patent applications. The companies either didn’t respond or declined interviews for this story.

As lip-reading AI emerges as a viable commercial product, technologists and privacy watchdogs are increasingly worried about how it’s being developed and how it may one day be deployed. SRAVI, for example, is not the only application of lip-reading AI that Liopa is working on. The company is also in phase two of a project with a UK defense research agency to develop a tool that would allow law enforcement agencies to search through silent CCTV footage and identify when people say certain keywords.

Surveillance company Motorola Solutions has a patent for a lip-reading system designed to aid police. Skylark Labs, a startup whose founder has ties to the U.S. Defense Advanced Research Projects Agency (DARPA), told Motherboard that its lip-reading system is currently deployed in private homes and a state-controlled power company in India to detect foul and abusive language.

“This is one of those areas, from my perspective, which is a good example of ‘just because we can do it, doesn’t mean we should,’” Fraser Sampson, the UK’s biometrics and surveillance camera commissioner, told Motherboard. “My principal concern in this area wouldn’t necessarily be what the technology could do and what it couldn’t do, it would be the chilling effect of people believing it could do what it says. If that then deterred them from speaking in public, then we’re in a much bigger area than simply privacy, and privacy is big enough.”

The emergence of lip-reading AI is reminiscent of facial recognition technology, which was a niche area of research for decades before it was quietly, but rapidly, commercialized as a surveillance tool beginning in the early 2000s.

Many of the problems with facial recognition have become public knowledge only within the last several years, due in large part to research and activism by people who were actively being harmed by it. Specifically, the landmark 2018 paper in which Joy Buolamwini and Timnit Gebru first revealed that facial recognition is less accurate for women and people of color.

By the time those concerns entered the mainstream discourse, facial recognition was ubiquitous in phones, private businesses, and the surveillance cameras nestled on street corners across many American cities. At least three Black men have been falsely arrested due to facial recognition—the real number is almost certainly higher—and the technology has been used to track Black Lives Matter protesters, among a variety of other questionable purposes. Over the past two years, and nearly 20 years after the first major public deployment of the technology, grassroots campaigns in more than a dozen cities and states have led to bans on police and private use of facial recognition.

The backlash against facial recognition is emblematic of a movement that’s driving a shift in thinking about how AI researchers should consider the future applications of their discoveries. The prestigious NeurIPS conference, for example, required researchers to submit impact statements about how their findings might affect society along with their papers for the first time last year.

“Research is terrific, but when we discover that a particular strand of knowledge or research has devastating consequences, then, as researchers, we have a responsibility to call a halt to it and implement policy changes,” Meredith Broussard, author of Artificial Unintelligence: How Computers Misunderstand the World, told Motherboard.

Lip-reading AI is still in its infancy as a commercial technology, but the early focus on surveillance is prompting concerns that the science is advancing so fast—and in some cases, behind closed corporate doors—that the consequences will once again become apparent too late.

“It is true that science moved too fast in the beginning, but in the last year there are several discussions in the published literature around ethical considerations for VSR technology,” said Stavros Petridis, who recently began working for Facebook but spoke to Motherboard about his previous research at Imperial College London. “Given that there are no commercial applications available yet, there are pretty good chances that this time ethical considerations will be taken into account before this technology is fully commercialized.”

Rodrigo Mira, a PhD candidate at Imperial College London (one of the leading groups studying lip-reading AI), told Motherboard that he and his colleagues “know that our field is controversial.” He compared the group’s work to penetration testing—the cybersecurity practice of finding vulnerabilities in computer systems in order to fix them. In other words, the research allows academic institutions bound by codes of ethics to discover new technologies before they can be deployed by bad actors like criminals.

“The main thing in AI is that people need to start talking about politics all the time,” Mira said. “It’s not about whether we should stop research, it’s that we have this power to make out what people are saying just by looking at them. What should we use it for? The way to stop [unethical uses of the technology] isn’t to shut down Imperial College. The way to deal with that is to deal with it as a political issue.”

AI ethicists agree that early and robust government regulation of biometric surveillance technologies like facial recognition and lip-reading AI is necessary to prevent discrimination and harm—but so far, many governments have failed to enact adequate laws. That’s why researchers have a responsibility not just to consider the potential consequences, but to proactively include the groups of people most likely to be harmed by the technology in their decision making processes.

So far, experts say those considerations aren’t being made for visual speech recognition systems.

“This is about actively creating a technology that can be put to harmful uses rather than identifying and mitigating vulnerabilities in existing technology,” Sarah Myers West, a researcher for the AI Now Institute, told Motherboard. “Researchers aren’t always going to be well-placed to make these assessments on their own. That’s why it’s so important to involve the communities that are going to be affected by their research throughout the process to anticipate and mitigate potential harmful secondary uses.

Liopa CEO Liam McQuillan told Motherboard that the company is at least a year away from having a system that can satisfactorily lip-read keywords from silent CCTV footage—a project that’s being funded by the British Defense and Security Accelerator—and that the company has considered the possibility of a privacy backlash. “There may be concerns here that actually forbid the ultimate use of this technology. … We’re not betting Liopa, certainly, on this use case but it is providing funding.”

McQuillan also said that the company is proactively seeking to address the potential for racial or gender bias by training its algorithms on data collected from a diverse set of YouTube clips, volunteers who offer to contribute videos through a collection app, and a company that curates datasets specifically designed to include people from different races and ethnicities. The company has not yet published any research on how its systems perform across demographic groups.

Motherboard did find one company that claims to be actively selling lip-reading AI systems, and it has fully embraced the surveillance market. Amarjot Singh, founder and CEO of Skylark Labs, told Motherboard that the company initially pitched its technology suite—which also includes facial recognition and violence and weapon detection algorithms—to police agencies in India. But the company found little appetite for the lip-reading function given the challenges of deploying it in crowded public spaces.

Skylark has since pivoted to other uses. Singh said the company’s lip-reading AI technology is currently being piloted by the Punjab State Power Corporation Limited, a government-controlled utility, to detect instances of employees harassing each other. Several individuals have also purchased the technology to monitor their babysitters, he said.

Skylark says its lip-reading AI can detect about 50 different words associated with cursing, abuse, and violence. Singh has published research about violence detection and facial recognition, and Indian police have used Skylark’s drones to enforce social distancing, according to local media reports. But neither Singh nor the company have published any research on lip-reading AI.

Motherboard contacted the Punjab State Power Corporation Limited and an individual Singh said uses the technology at home, but did not receive responses before publication.

“We’re doing it in the wild and trying to solve use cases that have a direct implication for the safety of people,” Singh said. “I think there is merit since the designer can control the words the system should flag, so I think it’s still kind of OK. The risk over here is that once you start to calibrate the systems to pick up everyday speech in the wild, that is when it becomes very hairy [ethically].”

The researchers and company executives interviewed for this story told Motherboard that it will be years before lip-reading AI is advanced enough to interpret full conversations, if it ever happens at all.

The task is incredibly challenging—even expert human lip readers are actually pretty poor at word-for-word interpretation. In 2018, Google subsidiary Deepmind published research unveiling its latest full-sentence lip-reading system. The AI achieved a word error rate (the percent of words it got wrong) of 41 percent on videos containing full sentences. Human lip readers viewing a similar sample of video-only clips had word error rates of 93 percent when given no context about the subject matter and 86 percent when given the video’s title, subject category, and several words in the sentence. That study was conducted using a large, custom-curated dataset.

The Imperial College London group presented a paper this month describing a full-sentence lip-reading system trained on a smaller, publicly available dataset of 400 hours of video that can achieve a word error rate as low as 37.9 percent.

When it comes to single keyword lip reading—the kind of tool Liopa and Skylark Labs are pursuing—the accuracy is much higher and has seen significant improvements in just the last year. In 2017, the highest accuracy achieved on the benchmark Lip Reading in the Wild dataset was 83 percent. That zenith stayed essentially the same until 2020, when a number of groups in quick succession proved they could top 83 percent accuracy. The record is currently 88.5 percent accuracy, achieved by the Imperial College London group in partnership with Samsung, according to a paper released this month.

It’s hard to know what the true pinnacle is, though. Deepmind—which many experts still consider the leading player in the field—hasn’t published any further research from its lip-reading program since the 2018 paper, and the company has declined to discuss that line of work.

Many of the researchers Motherboard spoke to were hesitant to speculate about what the big tech companies intend to do with this emerging technology, or where and when it will begin having noticeable effects on the broader public.

“One of the things the last 10 years in AI and [machine learning] have shown us is that there’s no way to predict the future in any meaningful way,” Mira said. “But it’s really unwise to underestimate things.”