Revealed: How Nvidia's 'backseat driver' AI learned to read lips

Driving assistant gives self-drivers a bit of Lip(Net)

Mom backseat driving photo by shutterstock
Yearning for a driving companion who listens when you talk?

When Nvidia popped the bonnet on its Co-Pilot "backseat driver" AI at this year’s Consumer Electronics Show, most onlookers were struck by its ability to lip-read while tracking CES-going "motorists'" actions within the "car".

A slide taken at CES shows the Co-Pilot AI assistant performing four features: facial recognition, head tracking, gaze tracking and lip-reading.

The automative AI is part of the GPU-flinger's DRIVE PX 2 platform, which uses sensors and multiple neural networks powered by the grunt of Nvidia's processors.

An Nvidia spokesperson has since confirmed in an email to The Register that the lip-reading component was based on research paper [PDF] written by academics from the University of Oxford, Google DeepMind and the Canadian Institute for Advanced Research.

"We are really happy to see LipNet in such an application and is the proof that our novel architecture is scalable to real-world problems," the research team added in an email to El Reg.

"Machine lip readers have enormous practical potential, with applications in speech recognition in noisy environments such as cars, improved hearing aids, silent dictation in public spaces (Siri will never have to hear your voice again), covert conversations, biometric identification, and silent-movie processing."

The paper was initially criticised. Although the neural network, LipNet, had an impressive accuracy rate of 93.4 per cent, it was only tested on a limited dataset of words and not coherent sentences. We're told LipNet was later retrained using a dataset of the 22 drivers to improve it.

"Since it is ongoing research we cannot disclose error rates," the LipNet team said of the retrained model. "But we can say that after less than a day of training, the performance was as good as expected."

Increasing the amount of useful training data improves AI models. For example, a second paper, unofficially published on arXiv by another team at Oxford, demonstrated a better AI-based lip-reading system is possible. It could decipher complete sentences after it had been trained to watch the speech movements of BBC News presenters for several hours.

Nvidia’s Co-Pilot assistant shows LipNet has progressed further to pick up the spoken commands of drivers so it can process instructions such as choosing a song to play, even when loud music is already thumping in the background.

The head- and gaze-tracking and facial recognition capabilities were developed to provide better security and a safer driving experience, said Nvidia.

“[There is] an AI for face recognition, so the car knows who you are, setting personal preferences and eliminating the need for a key. An AI for gaze detection, so your car knows if you’re paying attention,” Nvidia wrote in a blog post.

Nvidia is mostly known for designing powerful GPUs for gaming and HPC but has lately been putting more of its efforts towards GPU-accelerated machine learning and AI.

Mercedes, Audi, Tesla and Toyota are current customers of the new technology, an Nvidia spokesperson confirmed to The Register. ®

Biting the hand that feeds IT © 1998–2018