Mozilla releases voice dataset and transcription engine

Baidu's Deep Speech with TensorFlow under the covers

By Richard Chirgwin

Posted in Emergent Tech, 30th November 2017 05:02 GMT

Mozilla has revealed an open speech dataset and a TensorFlow-based transcription engine.

Mozilla floated "Project Common Voice" back in July 2017, when it called for volunteers to either submit samples of their speech or check machine translations of others' utterances.

The project has since collected 500 hours of samples (in the longer term, Common Voice wants 10,000 hours), comprising 400,000 recordings made by 20,000 people.

The project's Michael Henretty wrote that “most of us only have access to fairly limited collection of voice data; an essential component for creating high-quality speech recognition engines”. Even limited non-free data sets cost “upwards of tens of thousands of dollars”.

Mozilla's Sean White wrote that the job of extending Common Voice beyond English will begin in the first half of 2018.

Common Voice is available for download here, and if developers need more open source speech datasets, Mozilla helpfully links four other sets it was able to identify: LibriSpeech, the TED-LIUM Corpus, VoxForge, and Tatoeba.

Mozilla also announced an associated transcription effort based on Baidu's Deep Speech speech recognition project. Mozilla's Deep Speech “uses Google's TensorFlow project to make the implementation easier”, and claims a 6.5 per cent error rate on the LibriSpeech test-clean dataset.

Mozilla Deep Speech offers pre-built Python and Node.js packages and a command line binary.

In this post at Mozilla Hacks, Rueben Morais described Deep Speech as “an end-to-end trainable, character-level, deep recurrent neural network (RNN) … It can be trained using supervised learning from scratch, without any external 'sources of intelligence', like a grapheme to phoneme converter or forced alignment on the input.”

As Morais noted, with 120 million parameters in the Deep Speech model, the group needed one machine with four Titan X Pascal GPUs, and two more servers with eight of the GPUs each.

The result of all that work was that on a GPU-equipped MacBook Pro, Deep Speech can transcribe a little over three seconds of audio per second. With just a CPU, a second of transcription takes around 1.4 seconds. ®

Sign up to our NewsletterGet IT in your inbox daily


More from The Register

Mozilla and Yahoo! trade sueballs over Firefox-Google search deal

'Your search is trash and you stopped paying ' vs. 'we had a deal you can't walk away from'

Mozilla edict: 'Web-accessible' features need 'secure contexts'

If an API or feature needs the 'net, it needs HTTPS under Mozilla's new plan

Google, Mozilla both say they sped up the web today. One by blocking ads. One with ads

Guess which was which. Hint: Firefox now crushes trackers on Android

Mozilla whips out Rusty new Firefox Quantum (and that's a good thing)

Landmark build promises to be faster, slimmer, better at multi-threading

Meet VRfox: Mozilla's latest attempt at regaining browser share

v55 first desktop browser to support WebVR standard

Mozilla devs discuss ditching Dutch CA, because cryptowars

We don' want no STEENKIN' proxies, as will be possible under new local laws

Mozilla takes a turn slapping Symantec's certification SNAFU

Take Google's advice and get out of CA infrastructure'

Mozilla ponders making telemetry opt-out, 'cos hardly anyone opted in

Browser-maker wants to compile global top 100 sites list, promises to protect privacy

Mozilla extends, and ends, Firefox support for Windows XP and Vista

Even Extended Support Releases will be naked and alone as of June 2018

Microsoft concedes to Mozilla: Redmond will point web API docs at Moz Dev Network

Firefox, IE, Chrome makers: One manual to rule them all