## No, not the Simpsons – Simpson's Paradox

### They look alike

The clients are clearly different, the country may be different, so may the time of year, the weather and so on. It is worth remembering that when we combine data sets, we are (consciously or unconsciously) saying that we think they are similar enough to make combining them a sensible act. And really we have no evidence in this case that this is a reasonable assumption.

You can also see Simpson’s paradox in cases where percentages and/or actions are combined carelessly. Imagine two schools and two teachers. Sally works at school 1 and Brian at school 2, but they also occasionally teach at each other’s schools as part of a staff-sharing programme.

The teachers are rated by the percentage exam passes of the children they teach.

Brian Sally
School 1 90 per cent 88 per cent
School 2 75 per cent 60 per cent

Brian beats Sally in both schools. But, by now, you know what is going to happen. When we calculate the percentage for all children, the situation reverses, Sally beats Brian:

Brian Sally
Overall 75 per cent 87 per cent

In this case the paradox arises because the schools themselves have very different pass rates and the teachers teach very different numbers in the two schools.

Brian Brian Sally Sally
Pass Fail Pass Fail
School 1 Number of children 9 1 700 100
Percentages 90 10 88 13
School 2 Number of children 600 200 6 4
Percentages 75 25 60 40
Combined Number of children 609 201 706 104
Percentages 75 25 87 13

Brian achieves 90 per cent in school 1, but he only teaches ten children there so that percentage is based on a very small sample. He could easily have achieved only, say, 30 per cent had he taught a different set of children, but he lucked out. Likewise Sally taught only a few children in school 2 and perhaps she was unlucky in her sample.

Both the large and small samples are being used to produce the same kind of value (a percentage) but when we combine the two samples for each teacher, the samples with the high numbers of students overwhelm the samples with low numbers. Which is why, in each case, the percentages of the combined sets are very close to the percentages for the large samples.

Simpson’s paradox is a well-known and well-studied problem and whilst it had been identified previously, it was a British statistician who brought it to prominence with this (warning: PDF) 1951 paper**. And – quite worryingly – you might encounter Simpson’s Paradox more often than you think, according to Marios G Pavlides and Michael D Perlman here (PDF).

Once you start to combine data sets you are almost certain to run into this paradox. As a general take-home rule, it’s a good idea to dissuade people from combining data from non-identical sources unless a great deal is known about all the factors that are involved.

As for being you're damned if you do and damned if you don't? You'll have to find out for yourself. ®

** “The interpretation of interaction in contingency tables” in the Journal of the Royal Statistical Society (Series B), 13: 238–241)