Software

This article is more than 1 year old

What can The Simpsons teach us about stats algorithms? Glad you asked...

No, not the Simpsons – Simpson's Paradox

Wed 28 May 2014 // 08:31 UTC

They look alike

The clients are clearly different, the country may be different, so may the time of year, the weather and so on. It is worth remembering that when we combine data sets, we are (consciously or unconsciously) saying that we think they are similar enough to make combining them a sensible act. And really we have no evidence in this case that this is a reasonable assumption.

You can also see Simpson’s paradox in cases where percentages and/or actions are combined carelessly. Imagine two schools and two teachers. Sally works at school 1 and Brian at school 2, but they also occasionally teach at each other’s schools as part of a staff-sharing programme.

The teachers are rated by the percentage exam passes of the children they teach.

	Brian	Sally
School 1	90 per cent	88 per cent
School 2	75 per cent	60 per cent

Brian beats Sally in both schools. But, by now, you know what is going to happen. When we calculate the percentage for all children, the situation reverses, Sally beats Brian:

	Brian	Sally
Overall	75 per cent	87 per cent

In this case the paradox arises because the schools themselves have very different pass rates and the teachers teach very different numbers in the two schools.

		Brian	Brian	Sally	Sally
		Pass	Fail	Pass	Fail
School 1	Number of children	9	1	700	100
	Percentages	90	10	88	13
School 2	Number of children	600	200	6	4
	Percentages	75	25	60	40
Combined	Number of children	609	201	706	104
	Percentages	75	25	87	13

Brian achieves 90 per cent in school 1, but he only teaches ten children there so that percentage is based on a very small sample. He could easily have achieved only, say, 30 per cent had he taught a different set of children, but he lucked out. Likewise Sally taught only a few children in school 2 and perhaps she was unlucky in her sample.

Both the large and small samples are being used to produce the same kind of value (a percentage) but when we combine the two samples for each teacher, the samples with the high numbers of students overwhelm the samples with low numbers. Which is why, in each case, the percentages of the combined sets are very close to the percentages for the large samples.

Simpson’s paradox is a well-known and well-studied problem and whilst it had been identified previously, it was a British statistician who brought it to prominence with this (warning: PDF) 1951 paper**. And – quite worryingly – you might encounter Simpson’s Paradox more often than you think, according to Marios G Pavlides and Michael D Perlman here (PDF).

Once you start to combine data sets you are almost certain to run into this paradox. As a general take-home rule, it’s a good idea to dissuade people from combining data from non-identical sources unless a great deal is known about all the factors that are involved.

As for being you're damned if you do and damned if you don't? You'll have to find out for yourself. ®

** “The interpretation of interaction in contingency tables” in the Journal of the Royal Statistical Society (Series B), 13: 238–241)

Topics

Special Features

Vendor Voice

Resources

Software

What can The Simpsons teach us about stats algorithms? Glad you asked...

No, not the Simpsons – Simpson's Paradox

They look alike

More about

More about

More about

More about

More about

TIP US OFF

Other stories you might like

China outlines plan for National Integrated Government Affairs Big Data System

UK.gov finds billions in cash for big data contracts

Airbus pulls up hard, no longer buying 29.9% stake in Atos-owned Evidian

Protecting distributed branch office environments from ransomware

Ex-BigQuery exec and Motherduck CEO: For some users, the answer is to think small

Revealed: US telcos admit to storing, handing over location data

Amazon finally opens doors to its serverless analytics

We've never even built datacenters using robots here on Earth

Pyramid Analytics receives $120m in VC funding for 'decision intelligence'

About Us

Our Websites

Your Privacy