How to catch a fraudster – using 'top cop' Benford and the power of maths

Nothing attracts 'unnatural' numbers like this smooth curve

Application security programs and practises

Big Data's Big 5 Yes, we've been hit over the head enough times with the phrase "big data" to be aware of its presence, even though we've been up to our armpits in streams of huge unstructured datasets for years.

Those of you who are analysts or data scientists will have already picked up a set of tools that help you find hidden information buried deep in the data. Those tools may be languages (for example R), statistical tests (t-test, Analysis of Variance) and/or data mining techniques (clustering).

But there's a set of theorems, laws and simulations from the world of mathematics that can help you to solve more problems faster. As an added upside, you can increase your value - not that I am suggesting that a true artist, such as yourself, is concerned with anything as tacky as salary, of course.

The Reg has selected five such examples that we think are the most compelling for our purposes from the field of maths. Over the next few weeks we shall be looking at them from a high level to discover how they can potentially enhance and add value to what you do.

The five we will be looking at are:

  1. Benford’s Law: Numbers can be distributed in very unintuitive ways. Most fraudsters don’t understand that so their frauds can stick out like a sore thumb – as long as you know about Benford’s work.
  2. The German Tank Problem (and its solution): This can let you to estimate data that people don’t want you to have.
  3. Nyquist–Shannon sampling theorem: Now this does sound obscure because it is about the minimum sampling rate of a continuous wave, but in practice it will tell you how frequently you need to collect that big data from sensors like smart meters.
  4. Simpson’s paradox: If you don’t know about it, one day it will bite you.
  5. Monte Carlo simulations: One of the best and yet least-used tools in a data scientist tool box. They let you solve problems that probability calculations simply can’t touch.

For each one I’ll first give you a type of problem that can arise and then show you why the theorem helps to solve it. No difficult sums will be harmed in the making of this series.

Benford’s Law

So there you are, working with sales data and you have been given the job of detecting fraudulent transactions. A huge number of transactions are in the system and you have reason to believe that those originating from a particular country and credited to a particular sales person (J Smith) are fraudulent.

Your colleague: “OK, let’s check the mean and standard deviation of the transactions we suspect against those of the rest. Hmmm. No significant difference. Maybe we were wrong about poor old J Smith. She is kind to cats after all, she has about 12 rescued moggies that she looks after; perhaps we should look elsewhere for the evil perp.”

You: “Fair enough, but let’s do one more check. Take the value of all of the suspect transactions...






... and select just the leading number from each value:






Then, count the number of ones, the number of twos and so on (up to nine) and plot these as a frequency distribution.”

Your colleague: “OK, if it makes you happy, but you owe me a pint if this doesn’t show anything.”

Later that same day.

Your colleague: There is no pattern here, the distribution is essentially flat. So J Smith is off the hook and you owe me a pint.”

Benford first

Suspect transactions minus Benford - nothing to see here (click to enlarge graphs)

You:Au contraire my fine colleague, we need to find new homes for those felines and you owe me a pint.”

J Smith is about to be banged to rights... because she’d never heard of Benford’s law.

Benford’s Law (AKA First-Digit Law)

Benford comes to us courtesy of GE Research Laboratories physicist Frank Benford in the 1920s, who began looking into digital frequencies when he noticed his logarithm table books were unevenly worn. His law essentially says that the leading digits of numbers collected “from the wild” – real life – are not evenly distributed. Rather, they follow a predictable distribution where there are more ones than two, more two than threes and so on up to nine.

Benford distortion

Benford's distortion

The differences are non-trivial. On average about 30 per cent of the numbers will start with a one, only about eight per cent with a five and a mere 4.6 per cent with a nine.

We would, of course, have to check the distribution of invoice totals from the same country credited to other sales people but I would confidentially expect those to follow a Benford distribution.

So, what is meant by “wild collected” numbers and why do we get such an odd distribution?

Wild collected numbers

If you plot random numbers, they DO come out as a flat distribution. Here I have plotted the leading digit of around 600 random numbers.

Random numbers

Random numbers come out flat

Now you might think that numbers collected by actual observation of the real world (like the lengths of rivers, or their areas, or molecular masses of compounds or death rates or the heights of cities above sea level) would show the same distribution of leading integers, but in general they don’t; they show a distribution that approximates to a Benford distribution.

At this point you might be wondering if this is to do with the units in which you choose to measure, but no, this phenomenon is unit-independent. You can plot the leading digit of the height of each city above sea level in inches, feet, metres or cubits; it doesn’t matter, it still comes out as a Benford distribution.

Eight steps to building an HP BladeSystem

More from The Register

next story
Sysadmin Day 2014: Quick, there's still time to get the beers in
He walked over the broken glass, killed the thugs... and er... reconnected the cables*
Apple fanbois SCREAM as update BRICKS their Macbook Airs
Ragegasm spills over as firmware upgrade kills machines
Amazon Reveals One Weird Trick: A Loss On Almost $20bn In Sales
Investors really hate it: Share price plunge as growth SLOWS in key AWS division
SHOCK and AWS: The fall of Amazon's deflationary cloud
Just as Jeff Bezos did to books and CDs, Amazon's rivals are now doing to it
EU's top data cops to meet Google, Microsoft et al over 'right to be forgotten'
Plan to hammer out 'coherent' guidelines. Good luck chaps!
US judge: YES, cops or feds so can slurp an ENTIRE Gmail account
Crooks don't have folders labelled 'drug records', opines NY beak
Auntie remains MYSTIFIED by that weekend BBC iPlayer and website outage
Still doing 'forensics' on the caching layer – Beeb digi wonk
Manic malware Mayhem spreads through Linux, FreeBSD web servers
And how Google could cripple infection rate in a second
prev story


Top three mobile application threats
Prevent sensitive data leakage over insecure channels or stolen mobile devices.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Boost IT visibility and business value
How building a great service catalog relieves pressure points and demonstrates the value of IT service management.
Designing a Defense for Mobile Applications
Learn about the various considerations for defending mobile applications - from the application architecture itself to the myriad testing technologies.
Build a business case: developing custom apps
Learn how to maximize the value of custom applications by accelerating and simplifying their development.