Feeds

Google flu-finding service diagnosed with 'big data hubris'

Bad data contagion overwhelms prediction service

  • alert
  • submit to reddit

Intelligent flash storage arrays

A paper in Science claims that Google Flu Trends, unveiled back in 2008 to become a poster-child of Big Data, has one teeny, tiny, fatal flaw: it's almost always wrong.

The paper – abstract here – finds that not only did Flu Trends completely miss the 2009 swine flu, but for 100 of the 108 weeks since 2011, Google's predictions of influenza outbreaks are simply wrong, and the reason is simple: nearly everybody thinks the slightest sniffle means they have influenza.

That's quite at odds with how Google described the project when it first unveiled the project. As by Johns Hopkins professor Steven Salzberg here, the Chocolate Factory originally claimed that “we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day”.

As Salzberg notes, for the most recent week in which the Centre for Disease Control (CDC) has published data, only 8.8 per cent of specimens sent for testing returned a positive for influenza.

Talking to the Science podcast, here, researcher David Lazer of Northeastern University in Boston (and lead author of the paper) suggests one problem is that people – including highly-trained Oompa Loompas – love pattern matching, and that Google “overfit” the data.

“They ... overfit the data. They had fifty million search terms, and they found some that happened to fit the frequency of the 'flu' over the preceding decade or so, but really they were getting idiosyncratic terms that were peaking in the winter at the time the 'flu' peaks … but wasn't driven by the fact that people were actually sick with the 'flu',” he tells the podcast.

Having noted that this caused errors, Google ran a fix into Flu Trends, but since 2011, the system has been overestimating the number of 'flu' cases.

Calling this "big data hubris", Lazar says there were "certain assumptions baked into the analysis that doomed it in the long run". For example, he says, Google Flu Trends assumed a stable relationship between search terms and the incidence of influenza, which hasn't been the case.

Google's own search algorithms, which route someone from a 'flu' search to a suitable product, also play a part, Lazar says. Google search creates a kind of feedback loop which Google Flu Trends mistakenly interprets as an outbreak. ®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
PEAK APPLE: iOS 8 is least popular Cupertino mobile OS in all of HUMAN HISTORY
'Nerd release' finally staggers past 50 per cent adoption
Microsoft to bake Skype into IE, without plugins
Redmond thinks the Object Real-Time Communications API for WebRTC is ready to roll
Microsoft promises Windows 10 will mean two-factor auth for all
Sneak peek at security features Redmond's baking into new OS
Mozilla: Spidermonkey ATE Apple's JavaScriptCore, THRASHED Google V8
Moz man claims the win on rivals' own benchmarks
Yes, Virginia, there IS a W3C HTML5 standard – as of now, that is
You asked for it! You begged for it! Then you gave up! And now it's HERE!
FTDI yanks chip-bricking driver from Windows Update, vows to fight on
Next driver to battle fake chips with 'non-invasive' methods
DEATH by PowerPoint: Microsoft warns of 0-day attack hidden in slides
Might put out patch in update, might chuck it out sooner
Ubuntu 14.10 tries pulling a Steve Ballmer on cloudy offerings
Oi, Windows, centOS and openSUSE – behave, we're all friends here
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Getting started with customer-focused identity management
Learn why identity is a fundamental requirement to digital growth, and how without it there is no way to identify and engage customers in a meaningful way.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.
Website security in corporate America
Find out how you rank among other IT managers testing your website's vulnerabilities.