What can The Simpsons teach us about stats algorithms? Glad you asked...
No, not the Simpsons – Simpson's Paradox
They look alike
The clients are clearly different, the country may be different, so may the time of year, the weather and so on. It is worth remembering that when we combine data sets, we are (consciously or unconsciously) saying that we think they are similar enough to make combining them a sensible act. And really we have no evidence in this case that this is a reasonable assumption.
You can also see Simpson’s paradox in cases where percentages and/or actions are combined carelessly. Imagine two schools and two teachers. Sally works at school 1 and Brian at school 2, but they also occasionally teach at each other’s schools as part of a staff-sharing programme.
The teachers are rated by the percentage exam passes of the children they teach.
|School 1||90 per cent||88 per cent|
|School 2||75 per cent||60 per cent|
Brian beats Sally in both schools. But, by now, you know what is going to happen. When we calculate the percentage for all children, the situation reverses, Sally beats Brian:
|Overall||75 per cent||87 per cent|
In this case the paradox arises because the schools themselves have very different pass rates and the teachers teach very different numbers in the two schools.
|School 1||Number of children||9||1||700||100|
|School 2||Number of children||600||200||6||4|
|Combined||Number of children||609||201||706||104|
Brian achieves 90 per cent in school 1, but he only teaches ten children there so that percentage is based on a very small sample. He could easily have achieved only, say, 30 per cent had he taught a different set of children, but he lucked out. Likewise Sally taught only a few children in school 2 and perhaps she was unlucky in her sample.
Both the large and small samples are being used to produce the same kind of value (a percentage) but when we combine the two samples for each teacher, the samples with the high numbers of students overwhelm the samples with low numbers. Which is why, in each case, the percentages of the combined sets are very close to the percentages for the large samples.
Simpson’s paradox is a well-known and well-studied problem and whilst it had been identified previously, it was a British statistician who brought it to prominence with this (warning: PDF) 1951 paper**. And – quite worryingly – you might encounter Simpson’s Paradox more often than you think, according to Marios G Pavlides and Michael D Perlman here (PDF).
Once you start to combine data sets you are almost certain to run into this paradox. As a general take-home rule, it’s a good idea to dissuade people from combining data from non-identical sources unless a great deal is known about all the factors that are involved.
As for being you're damned if you do and damned if you don't? You'll have to find out for yourself. ®
** “The interpretation of interaction in contingency tables” in the Journal of the Royal Statistical Society (Series B), 13: 238–241)
Sponsored: Hyper-scale data management