How to catch a fraudster – using 'top cop' Benford and the power of maths

Nothing attracts 'unnatural' numbers like this smooth curve

Protecting against web application threats using SSL

Big Data's Big 5 Yes, we've been hit over the head enough times with the phrase "big data" to be aware of its presence, even though we've been up to our armpits in streams of huge unstructured datasets for years.

Those of you who are analysts or data scientists will have already picked up a set of tools that help you find hidden information buried deep in the data. Those tools may be languages (for example R), statistical tests (t-test, Analysis of Variance) and/or data mining techniques (clustering).

But there's a set of theorems, laws and simulations from the world of mathematics that can help you to solve more problems faster. As an added upside, you can increase your value - not that I am suggesting that a true artist, such as yourself, is concerned with anything as tacky as salary, of course.

The Reg has selected five such examples that we think are the most compelling for our purposes from the field of maths. Over the next few weeks we shall be looking at them from a high level to discover how they can potentially enhance and add value to what you do.

The five we will be looking at are:

  1. Benford’s Law: Numbers can be distributed in very unintuitive ways. Most fraudsters don’t understand that so their frauds can stick out like a sore thumb – as long as you know about Benford’s work.
  2. The German Tank Problem (and its solution): This can let you to estimate data that people don’t want you to have.
  3. Nyquist–Shannon sampling theorem: Now this does sound obscure because it is about the minimum sampling rate of a continuous wave, but in practice it will tell you how frequently you need to collect that big data from sensors like smart meters.
  4. Simpson’s paradox: If you don’t know about it, one day it will bite you.
  5. Monte Carlo simulations: One of the best and yet least-used tools in a data scientist tool box. They let you solve problems that probability calculations simply can’t touch.

For each one I’ll first give you a type of problem that can arise and then show you why the theorem helps to solve it. No difficult sums will be harmed in the making of this series.

Benford’s Law

So there you are, working with sales data and you have been given the job of detecting fraudulent transactions. A huge number of transactions are in the system and you have reason to believe that those originating from a particular country and credited to a particular sales person (J Smith) are fraudulent.

Your colleague: “OK, let’s check the mean and standard deviation of the transactions we suspect against those of the rest. Hmmm. No significant difference. Maybe we were wrong about poor old J Smith. She is kind to cats after all, she has about 12 rescued moggies that she looks after; perhaps we should look elsewhere for the evil perp.”

You: “Fair enough, but let’s do one more check. Take the value of all of the suspect transactions...






... and select just the leading number from each value:






Then, count the number of ones, the number of twos and so on (up to nine) and plot these as a frequency distribution.”

Your colleague: “OK, if it makes you happy, but you owe me a pint if this doesn’t show anything.”

Later that same day.

Your colleague: There is no pattern here, the distribution is essentially flat. So J Smith is off the hook and you owe me a pint.”

Benford first

Suspect transactions minus Benford - nothing to see here (click to enlarge graphs)

You:Au contraire my fine colleague, we need to find new homes for those felines and you owe me a pint.”

J Smith is about to be banged to rights... because she’d never heard of Benford’s law.

Benford’s Law (AKA First-Digit Law)

Benford comes to us courtesy of GE Research Laboratories physicist Frank Benford in the 1920s, who began looking into digital frequencies when he noticed his logarithm table books were unevenly worn. His law essentially says that the leading digits of numbers collected “from the wild” – real life – are not evenly distributed. Rather, they follow a predictable distribution where there are more ones than two, more two than threes and so on up to nine.

Benford distortion

Benford's distortion

The differences are non-trivial. On average about 30 per cent of the numbers will start with a one, only about eight per cent with a five and a mere 4.6 per cent with a nine.

We would, of course, have to check the distribution of invoice totals from the same country credited to other sales people but I would confidentially expect those to follow a Benford distribution.

So, what is meant by “wild collected” numbers and why do we get such an odd distribution?

Wild collected numbers

If you plot random numbers, they DO come out as a flat distribution. Here I have plotted the leading digit of around 600 random numbers.

Random numbers

Random numbers come out flat

Now you might think that numbers collected by actual observation of the real world (like the lengths of rivers, or their areas, or molecular masses of compounds or death rates or the heights of cities above sea level) would show the same distribution of leading integers, but in general they don’t; they show a distribution that approximates to a Benford distribution.

At this point you might be wondering if this is to do with the units in which you choose to measure, but no, this phenomenon is unit-independent. You can plot the leading digit of the height of each city above sea level in inches, feet, metres or cubits; it doesn’t matter, it still comes out as a Benford distribution.

Choosing a cloud hosting partner with confidence

More from The Register

next story
Wanna keep your data for 1,000 YEARS? No? Hard luck, HDS wants you to anyway
Combine Blu-ray and M-DISC and you get this monster
US boffins demo 'twisted radio' mux
OAM takes wireless signals to 32 Gbps
Google+ GOING, GOING ... ? Newbie Gmailers no longer forced into mandatory ID slurp
Mountain View distances itself from lame 'network thingy'
Apple flops out 2FA for iCloud in bid to stop future nude selfie leaks
Millions of 4chan users howl with laughter as Cupertino slams stable door
Students playing with impressive racks? Yes, it's cluster comp time
The most comprehensive coverage the world has ever seen. Ever
Run little spreadsheet, run! IBM's Watson is coming to gobble you up
Big Blue's big super's big appetite for big data in big clouds for big analytics
Seagate's triple-headed Cerberus could SAVE the DISK WORLD
... and possibly bring us even more HAMR time. Yay!
prev story


Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
Storage capacity and performance optimization at Mizuno USA
Mizuno USA turn to Tegile storage technology to solve both their SAN and backup issues.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
The next step in data security
With recent increased privacy concerns and computers becoming more powerful, the chance of hackers being able to crack smaller-sized RSA keys increases.