Numbers war: How Bayesian vs frequentist statistics influence AI
Not all figures are equal
If you want to develop your ML and AI skills, you will need to pick up some statistics and before you have got more than a few steps down that path you will find (whether you like it or not) that you have entered the Twilight Zone that is the frequentist/Bayesian religious war. I use the term "war" advisedly because war, by definition, has moved beyond debate and discussion. "Religious" because the war is based on belief systems, not information.
The frequentist world has been briefly described here. The Bayesian world is described in what follows.
Imagine that a zombie plague is sweeping the country. Infected people look healthy for a period and then turn into the living dead. We have a test to detect infected people before they turn into zombies and it is 99 per cent efficient in both directions. In other words, infected people test positive 99 per cent of the time and healthy people test negative 99 per cent of the time. Bob is tested and comes up positive. What are the chances that he has the disease?
A common (intuitive) answer is 99 per cent but let's work through an example to see if this is correct. In order to do this calculation, we need to know the efficiency of the test (which we do: 99 per cent). We also need a figure for the prevalence of the infection in the population; if we don't know it, we can start by guessing that half of the population is infected and half is healthy. We don't actually need to know the size of the population but it makes it easy to show you some numbers, so we'll assume a total population of two thousand; so we have 1,000 infected and 1,000 healthy.
In each group 99 per cent (which is 990 people) will be diagnosed correctly and one per cent (10 people) incorrectly. We can summarise the figures in this table:
Let's consider simply those who, like Bob, tested positive. They total 1,000 people and, of those, 990 (99 per cent) are infected. So there is a 99 per cent chance that Bob is infected. So, on this showing, it looks as if the intuitive answer is correct. But what about the assumption we made about the ratio of infected:healthy? What happens if we change it? Let's drop the number of infected to 10 per cent (200 people out of a total of 2,000) and rerun the calculation. Below, are those results – summarised:
Interestingly, Bob's "chances" have improved. Of the 216 people who test positive only 198 are infected, which is 92 per cent (198÷216x100), so his chance of being infected has dropped. If the level of infection drops to 1 in 1,000 there is only a 9 per cent chance that Bob is infected; at one in 10,000 the chance is a mere 1 per cent.
It turns out that the only time the intuitive answer (99 per cent) is correct is when we have exactly equal numbers of healthy and infected people. Since that is unlikely, the intuitive answer is usually incorrect.
But why? Well, the intuitive answer is intuitive because we reason that, "The test is 99 per cent accurate so, if he tests positive surely he has a 99 per cent chance of being a living-dead man walking?" But this line of reasoning ignores the fact that 1 per cent of the healthy people will test positive and, as the proportion of healthy people increases, the number of those healthy people who test as positive begins to overwhelm those who are infected and also test positive.
In slightly more formal terms we would say that the number of false positives (healthy people being misdiagnosed) begins to overwhelm the true positives (infected people testing positive).
What has this got to do with the Reverend Thomas Bayes (1701-1761)? Well, he described this class of "conditional" interactions between probabilities; they are conditional in the sense that we are asking, "What are the chances that Bob is infected, given that we already know he tested positive?"
Bayes also worked out how conditional probabilities interact so that now we don't have to play around with grids of numbers, we have a formula:
P(A|B) = (P(B|A) x (P(A)) ÷ P(B)
This translates (in our case) as follows:
- P(A) – probability of being infected
- P(B) – probability that any individual in the population will test positive (in this case this will be the probability of a positive test occurring in the population for an infected person PLUS the probability of a positive test occurring in the population for a healthy person)
- P(A|B) – probability of being infected if tested positive (The bar symbol can be read as "if")
- P(B|A) – probability of testing positive if infected
If we plug in the numbers for a 50:50 infected:healthy population we get:
Probability of being infected if test positive = (0.99 x 0.5) ÷ ((0.99 x 0.5) + (0.01 x 0.5)) = 0.495/0.5 = 0.99 = 99 per cent
The figures for one infected in 10,000 are:
Probability of being infected if test positive = (0.99 x 0.0001) ÷ ((0.99 x 0.00001) + (0.01 x 0.9999)) = 0.0098 = approximately 1 per cent.
You could be forgiven for thinking, "Big deal, he wrote an equation," but the crucial point is that not only did he quantify conditional probabilities, he also paved the way for a branch of statistics that acknowledges, and tries to account for, what are called "priors".
Priors are essentially mathematical quantifications of prior knowledge. In this case we know the efficiency of the test is 99 per cent. We have some prior knowledge of the infection rate in the population; we may not be sure about it but Bayes lets us allow for it even so. And priors are where the wars start with the frequentists. Now I hope I have convinced you that it is so obvious that the infection rate affects the result that no one would be stupid enough NOT to take it into account if they knew what it was because it clearly makes a difference.
But suppose you don't know what it is? Suppose you are working flat out to try to stop a zombie infection, people are dying (and eating each other) so you don't have time to do a survey of the population. But you think that maybe, say, 10 per cent are infected.
A Bayesians would say, "Great, let's use that figure. As we get more information, we'll update it."
However, a dyed-in-the-wool frequentist would say, "But you don't know the actual number. How can you possibly do statistics on a guess!? The next person might guess a totally different number! Are you going to do lots of different tests, each one based on a different person's guess? In that case you will get as many answers as you have people!"
To which the Bayesianist would reply, "So, what are you going to do? Ignore the infection rate entirely? In which case you simply cannot make any estimate at all!"
And so on. And this is just the start. If you read more about the frequentist and Bayesian views of the world it turns out that they diverge much further and the debate becomes much more of a philosophical one about how you view the world.
What do I think? Well, in my piece on frequentist statistics I referenced Pierre-Simon Laplace as someone who promoted the use of statistics in science and who actively promoted both Bayesian and frequentist. He saw no conflict and since he is rated as one of the greatest scientists of all time, I'll happily take my lead from him.
It is quite clear that both worlds have a huge amount to offer in terms of a range of tools/techniques/solutions. The intelligent, non-rabid people I know simply ignore the wars and use whatever tool is appropriate. ®