Feeds

What can The Simpsons teach us about stats algorithms? Glad you asked...

No, not the Simpsons – Simpson's Paradox

Intelligent flash storage arrays

Big Data's Big 5 When his class is asked to give an example of a paradox in The Simpsons, Bart offers: "You're damned if ya' do, and you're damned if ya' don't."

The dictionary defines a paradox as an absurd or seemingly absurd or contradictory statement that might prove to be true and when it comes to data a seemingly contradictory situation can arise when you're collecting and comparing data.

It's a rather subtle problem, something that can quite easily catch you out, and it's called Simpson’s Paradox*.

Imagine you work at an advertising agency and are closely involved in an ad campaign for client A. You measure the amount spent on the campaign in specific areas and also the effect it has (increased sales of product Z, predictably).

At the end of the campaign you plot advertising spend against sales and produce a graph with a lovely line showing a positive correlation: the more your client spends with you, the greater the effect of the advertising. Let’s take it as read that not only is there a correlation (as one number increases, so does the other), but that causation is also at work. That is, you also prove that the act of spending more on advertising really is directly causing the product to sell better.

You move on to work for client B, one with deeper pockets, and run another ad campaign. Again you measure spend and effect and again you produce a genuinely accurate plot showing a positive correlation; and again you can prove causation.

Now you’re pitching to client C and want to show that spending more gives better results, so what better illustration than to combine the results from the previous campaigns (anonymised, of course) into one show-stopping graph in your presentation? You plug in the numbers, create the graph and – aargh! Suddenly you have a negative correlation, showing that spending more money on advertising has a negative effect on sales.

How can this be? Two positives can’t combine into a negative, surely? And that’s Simpson’s Paradox in a nutshell.

To resolve the paradox it helps if we first colour the data points from each data set, so we can see what’s going on.

The first data set had a relatively low spend and a high return, the second a much higher spend and, on average, a much lower return. To illustrate why this causes the negative correlation when they are combined, we can calculate an “average” point for each set of data. (I am not recommending this as a statistical test, merely a way of illustrating what is going on).

Now you can see that across the two data sets, as average spend increases, the average return actually drops. This is because the average spend of the second set is higher than the first while its average return is actually lower. This is irrespective of the fact, which we know already, that within each set, the average return increases.

The next question might be “Well, why are these averages so different?”

The answer is that while the two have certain similarities (we know the advertising spend and we can measure the effect), in reality a great number of factors may be very different.

* The eponymous Simpson here is actually Brit statistician Edward Simpson – more on him on page two

Internet Security Threat Report 2014

Next page: They look alike

More from The Register

next story
PEAK APPLE: iOS 8 is least popular Cupertino mobile OS in all of HUMAN HISTORY
'Nerd release' finally staggers past 50 per cent adoption
Microsoft to bake Skype into IE, without plugins
Redmond thinks the Object Real-Time Communications API for WebRTC is ready to roll
Microsoft promises Windows 10 will mean two-factor auth for all
Sneak peek at security features Redmond's baking into new OS
Mozilla: Spidermonkey ATE Apple's JavaScriptCore, THRASHED Google V8
Moz man claims the win on rivals' own benchmarks
Yes, Virginia, there IS a W3C HTML5 standard – as of now, that is
You asked for it! You begged for it! Then you gave up! And now it's HERE!
FTDI yanks chip-bricking driver from Windows Update, vows to fight on
Next driver to battle fake chips with 'non-invasive' methods
DEATH by PowerPoint: Microsoft warns of 0-day attack hidden in slides
Might put out patch in update, might chuck it out sooner
Ubuntu 14.10 tries pulling a Steve Ballmer on cloudy offerings
Oi, Windows, centOS and openSUSE – behave, we're all friends here
prev story

Whitepapers

Choosing cloud Backup services
Demystify how you can address your data protection needs in your small- to medium-sized business and select the best online backup service to meet your needs.
Getting started with customer-focused identity management
Learn why identity is a fundamental requirement to digital growth, and how without it there is no way to identify and engage customers in a meaningful way.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.
The Heartbleed Bug: how to protect your business with Symantec
What happens when the next Heartbleed (or worse) comes along, and what can you do to weather another chapter in an all-too-familiar string of debilitating attacks?