What can The Simpsons teach us about stats algorithms? Glad you asked...
No, not the Simpsons – Simpson's Paradox
Big Data's Big 5 When his class is asked to give an example of a paradox in The Simpsons, Bart offers: "You're damned if ya' do, and you're damned if ya' don't."
The dictionary defines a paradox as an absurd or seemingly absurd or contradictory statement that might prove to be true and when it comes to data a seemingly contradictory situation can arise when you're collecting and comparing data.
It's a rather subtle problem, something that can quite easily catch you out, and it's called Simpson’s Paradox*.
Imagine you work at an advertising agency and are closely involved in an ad campaign for client A. You measure the amount spent on the campaign in specific areas and also the effect it has (increased sales of product Z, predictably).
At the end of the campaign you plot advertising spend against sales and produce a graph with a lovely line showing a positive correlation: the more your client spends with you, the greater the effect of the advertising. Let’s take it as read that not only is there a correlation (as one number increases, so does the other), but that causation is also at work. That is, you also prove that the act of spending more on advertising really is directly causing the product to sell better.
You move on to work for client B, one with deeper pockets, and run another ad campaign. Again you measure spend and effect and again you produce a genuinely accurate plot showing a positive correlation; and again you can prove causation.
Now you’re pitching to client C and want to show that spending more gives better results, so what better illustration than to combine the results from the previous campaigns (anonymised, of course) into one show-stopping graph in your presentation? You plug in the numbers, create the graph and – aargh! Suddenly you have a negative correlation, showing that spending more money on advertising has a negative effect on sales.
How can this be? Two positives can’t combine into a negative, surely? And that’s Simpson’s Paradox in a nutshell.
To resolve the paradox it helps if we first colour the data points from each data set, so we can see what’s going on.
The first data set had a relatively low spend and a high return, the second a much higher spend and, on average, a much lower return. To illustrate why this causes the negative correlation when they are combined, we can calculate an “average” point for each set of data. (I am not recommending this as a statistical test, merely a way of illustrating what is going on).
Now you can see that across the two data sets, as average spend increases, the average return actually drops. This is because the average spend of the second set is higher than the first while its average return is actually lower. This is irrespective of the fact, which we know already, that within each set, the average return increases.
The next question might be “Well, why are these averages so different?”
The answer is that while the two have certain similarities (we know the advertising spend and we can measure the effect), in reality a great number of factors may be very different.
* The eponymous Simpson here is actually Brit statistician Edward Simpson – more on him on page two
They look alike
The clients are clearly different, the country may be different, so may the time of year, the weather and so on. It is worth remembering that when we combine data sets, we are (consciously or unconsciously) saying that we think they are similar enough to make combining them a sensible act. And really we have no evidence in this case that this is a reasonable assumption.
You can also see Simpson’s paradox in cases where percentages and/or actions are combined carelessly. Imagine two schools and two teachers. Sally works at school 1 and Brian at school 2, but they also occasionally teach at each other’s schools as part of a staff-sharing programme.
The teachers are rated by the percentage exam passes of the children they teach.
|School 1||90 per cent||88 per cent|
|School 2||75 per cent||60 per cent|
Brian beats Sally in both schools. But, by now, you know what is going to happen. When we calculate the percentage for all children, the situation reverses, Sally beats Brian:
|Overall||75 per cent||87 per cent|
In this case the paradox arises because the schools themselves have very different pass rates and the teachers teach very different numbers in the two schools.
|School 1||Number of children||9||1||700||100|
|School 2||Number of children||600||200||6||4|
|Combined||Number of children||609||201||706||104|
Brian achieves 90 per cent in school 1, but he only teaches ten children there so that percentage is based on a very small sample. He could easily have achieved only, say, 30 per cent had he taught a different set of children, but he lucked out. Likewise Sally taught only a few children in school 2 and perhaps she was unlucky in her sample.
Both the large and small samples are being used to produce the same kind of value (a percentage) but when we combine the two samples for each teacher, the samples with the high numbers of students overwhelm the samples with low numbers. Which is why, in each case, the percentages of the combined sets are very close to the percentages for the large samples.
Simpson’s paradox is a well-known and well-studied problem and whilst it had been identified previously, it was a British statistician who brought it to prominence with this (warning: PDF) 1951 paper**. And – quite worryingly – you might encounter Simpson’s Paradox more often than you think, according to Marios G Pavlides and Michael D Perlman here (PDF).
Once you start to combine data sets you are almost certain to run into this paradox. As a general take-home rule, it’s a good idea to dissuade people from combining data from non-identical sources unless a great deal is known about all the factors that are involved.
As for being you're damned if you do and damned if you don't? You'll have to find out for yourself. ®
** “The interpretation of interaction in contingency tables” in the Journal of the Royal Statistical Society (Series B), 13: 238–241)