Feeds

What can The Simpsons teach us about stats algorithms? Glad you asked...

No, not the Simpsons – Simpson's Paradox

Internet Security Threat Report 2014

They look alike

The clients are clearly different, the country may be different, so may the time of year, the weather and so on. It is worth remembering that when we combine data sets, we are (consciously or unconsciously) saying that we think they are similar enough to make combining them a sensible act. And really we have no evidence in this case that this is a reasonable assumption.

You can also see Simpson’s paradox in cases where percentages and/or actions are combined carelessly. Imagine two schools and two teachers. Sally works at school 1 and Brian at school 2, but they also occasionally teach at each other’s schools as part of a staff-sharing programme.

The teachers are rated by the percentage exam passes of the children they teach.

Brian Sally
School 1 90 per cent 88 per cent
School 2 75 per cent 60 per cent

Brian beats Sally in both schools. But, by now, you know what is going to happen. When we calculate the percentage for all children, the situation reverses, Sally beats Brian:

Brian Sally
Overall 75 per cent 87 per cent

In this case the paradox arises because the schools themselves have very different pass rates and the teachers teach very different numbers in the two schools.

Brian Brian Sally Sally
Pass Fail Pass Fail
School 1 Number of children 9 1 700 100
Percentages 90 10 88 13
School 2 Number of children 600 200 6 4
Percentages 75 25 60 40
Combined Number of children 609 201 706 104
Percentages 75 25 87 13

Brian achieves 90 per cent in school 1, but he only teaches ten children there so that percentage is based on a very small sample. He could easily have achieved only, say, 30 per cent had he taught a different set of children, but he lucked out. Likewise Sally taught only a few children in school 2 and perhaps she was unlucky in her sample.

Both the large and small samples are being used to produce the same kind of value (a percentage) but when we combine the two samples for each teacher, the samples with the high numbers of students overwhelm the samples with low numbers. Which is why, in each case, the percentages of the combined sets are very close to the percentages for the large samples.

Simpson’s paradox is a well-known and well-studied problem and whilst it had been identified previously, it was a British statistician who brought it to prominence with this (warning: PDF) 1951 paper**. And – quite worryingly – you might encounter Simpson’s Paradox more often than you think, according to Marios G Pavlides and Michael D Perlman here (PDF).

Once you start to combine data sets you are almost certain to run into this paradox. As a general take-home rule, it’s a good idea to dissuade people from combining data from non-identical sources unless a great deal is known about all the factors that are involved.

As for being you're damned if you do and damned if you don't? You'll have to find out for yourself. ®

** “The interpretation of interaction in contingency tables” in the Journal of the Royal Statistical Society (Series B), 13: 238–241)

Security for virtualized datacentres

More from The Register

next story
Microsoft WINDOWS 10: Seven ATE Nine. Or Eight did really
Windows NEIN skipped, tech preview due out on Wednesday
Business is back, baby! Hasta la VISTA, Win 8... Oh, yeah, Windows 9
Forget touchscreen millennials, Microsoft goes for mouse crowd
Apple: SO sorry for the iOS 8.0.1 UPDATE BUNGLE HORROR
Apple kills 'upgrade'. Hey, Microsoft. You sure you want to be like these guys?
ARM gives Internet of Things a piece of its mind – the Cortex-M7
32-bit core packs some DSP for VIP IoT CPU LOL
Microsoft on the Threshold of a new name for Windows next week
Rebranded OS reportedly set to be flung open by Redmond
Lotus Notes inventor Ozzie invents app to talk to people on your phone
Imagine that. Startup floats with voice collab app for Win iPhone
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.