2016. AI boffins picked a hell of a year to train a neural net by making it watch the news
Lipreading software must think us humans are maniacs
LipNet, the lipreading network developed by researchers at the University of Oxford and DeepMind, can now lipread from TV shows better than professional lipreaders.
The first LipNet paper, which is currently under review for International Conference on Learning Representations – ICLR 2017, a machine learning conference, was criticised for using a limited dataset to test LipNet’s accuracy. The GRID corpus is made up of sentences that have a strict word order and make no sense on their own.
The second paper released on arXiv, however, is a better indication of a machine’s superior lipreading abilities, as it tests the system on hours of speech movements from the speakers on the BBC News, Question Time, Breakfast and Newsnight UK TV shows.
It’s an “open-world problem,” where sentences are unconstrained in content and length, and representative of natural human speech.
The Watch, Listen, Attend and Spell (WLAS) network has a lower word accuracy rate than LipNet, at 46.8 per cent compared to 93.4 per cent. But it’s more complex and is dealing with a more difficult task.
It works by having image and audio encoders that focus on the mouth movements of a talking face and predict the characters being spoken, and a character decoder to spell out the words.
Researchers sampled the TV shows to create the Lip Reading Sentences (LRS) dataset for visual speech recognition. The WLAS network analysed the speech movements from the LRS dataset, which contains over 100,000 natural sentences and 17,428 words.
Top row: Original still images from the BBC lipreading dataset – News, Question Time, Breakfast, Newsnight (from left to right). Bottom row: The mouth motions for ‘afternoon’ from two different speakers. The network sees the areas inside the red squares. (Photo credit: University of Oxford and Google Deepmind)
WLAS still requires a lot of training, like LipNet, and only a small part of the LRS dataset is used to test the WLAS network. Out of the 17,428 words, 6,882 were used in the test set, but 6,253 had been previously encountered through the training and validation process.
The model was trained from scratch. When researchers tried to train the WLAS network on whole sentences, the learning rate was too slow and it was difficult to extract all the relevant information, so the sentences had to be broken down into single words.
As the machine learns more words over time, it can gradually piece them together to form sequences, and then the sentences which are in the dataset.
From lipreading only, the word error rate for WLAS at 53.2 per cent is considerably better than professional lipreaders at 73.8 per cent. Lipreading is a difficult task because of homophenes – words that sound different but look the same when spoken – it can be difficult to determine whether a word begins with a 'p' or a 'b'.
The system struggles to deal with noise, however, and when it has to translate from audio clips, the word error rate shoots up to 74.5 per cent. When audio and lipreading are added together, the word error rate goes down to 50.8 per cent.
A lipreading neural network has many advantages. The researchers are interested in developing better hearing aids – but maybe it could be used in more sinister ways, such as listening in on secret conversations on CCTV cameras. ®