Who you callin' stoopid? No excuses for biz intelligence's poor stats

Note to industry: Must try harder

Business Intelligence (BI) systems are designed to turn raw data into useful information, so why don’t they do the job properly? Why do most of them fail so completely to make use of the huge range of capabilities that the analytics world has to offer?

Even at the most basic level, they fail catastrophically to take simple statistics for example. Suppose that you use your favourite BI visualisation tool for studying the sales of oranges for the last months to men and women in West Morton. You find that you sell more oranges to men than women but not by much:

Female 10,230

You would probably intuitively decide that the sales to men and women are effectively the same and that the tiny difference is due to the fact that numbers do vary a little in real life. (The technical name for numbers varying a little because of real life is Sampling Error.)

If, on the other hand, you saw:

Female 4

You might well conclude that there was a “real” difference here and that you needed to, for example, advertise specifically to women.

The above two cases are easy, but suppose you got:

Female 10,070

Is that a “real” difference, or are the sales to men and women essentially identical and the difference due simply to our old friend Sampling Error?

The average BI tool will simply show you the numbers and draw you a graph but it will leave you to decide how significant this difference really is.

Is this significant?

What it SHOULD do is perform a chi squared test, which will give you something both elegantly simple and incredibly useful - the probability that the difference you see is due to sampling error. With the numbers given above, that probability is 0.25 that immediately tells you that if you focus your advertising strategy on encouraging women to buy more oranges you have a 25 per cent chance of being wrong. As a general rule, do you want 1 in 4 of your business decisions to be incorrect?

Surely most rational people actually want to base their decisions on more than guesswork; we even have a name for it: Information-based decision making.

So why don’t BI tools offer this kind of information? Some time ago, I put this very question to one of the companies that develop such tools and was told that the average business user doesn’t understand statistics, an answer that amazingly manages to miss the point in a number of spectacular ways.

The “average” user may well not understand statistics but that surely puts the responsibility onto the designer of the BI package to do the appropriate stats in the background and present the valuable information that they reveal to the user.

So it shouldn’t present the stats as data like this:

ChiSquaredValue=1.324665

Degrees of freedom=1

P=0.249757

It should present the information that is inherent in the stats:

There is a 25 per cent chance that the difference in these figures is simply due to random chance.

Finally, just because “most” people don’t understand stats as this company’s representative claimed doesn’t mean that no one does, so the stats can be made available to those who want to see them. It’s called an “option” and you put it in the menu system.

And, of course, it isn’t just chi squared - there are many different stats that could automatically and meaningfully be applied to raw data. But even if a BI tool intelligently applied them all, that would still be just the beginning. These basic statistical tests are typically more than 100 years old and we have moved on so far since then.

Let’s fast-forward to how analytics could intelligently be incorporated into BI tools. The tools should be able to look at your data and, before even doing any stats, try to apply some context to the data. For example, imagine a table called Customer with a column called PostCode that contains values such as DD1 4HN and SW1A 4WW.

The tool should recognise this as UK location data. It should also be programmed to know that applying analytics to actual postcodes is usually a waste of time because individual postcodes are far too detailed. Unless you have millions of customers, the majority of them will have unique postcodes. But postcodes are at the bottom of a hierarchy and the system should be able to look up the town/county and country associated with each code and look for patterns associated with these much broader groups.

Now take DateOfBirth. You really don’t want to look for patterns on individual dates (see postcodes above) so the system could calculate the ages of the people concerned and plot the distribution.

Slicing the data

It could show these to the user and offer to:

• automatically bucket this into three groups 15-30, 35-60 and 65-70
• allow the user to apply domain knowledge and bucket the data some other way

What it shouldn’t do is to use the raw date of birth.

Now let’s get really radical (not really, these are still very basic analysis techniques). Most data is multi-dimensional. This wonderful sounding term simply means that we have numerical measures (the number of oranges, the value of orders) and these are affected by multiple dimensions (sex, location, age and so on).

The software could look at your data, perform multiple analyses on it without boring you with the details, work out which of the dimensions are the most important in terms of having the greatest effect. Techniques like PCA (Principal Component Reduction) and SVD (Singular Value Decomposition) spring to mind. After that the software simply suggests the most useful analytical possibilities.

Oh, and it could automatically look for outliers and missing data, and it could do some clustering in the background - the possibilities are almost endless.

If you think I am being hard on makers of BI tools, there is something equally bad going on here.

Microsoft’s Excel is arguably more widely used than the standard BI tool and the choice for most Microsoft users thanks to its presence in Office – for years, the desktop productivity suite for many.

However, if you graph the Orange numbers above then by default Excel displays the data as you see below, with the scale of the Y axis set to emphasise the difference.