Feeds

How to catch a fraudster – using 'top cop' Benford and the power of maths

Nothing attracts 'unnatural' numbers like this smooth curve

Internet Security Threat Report 2014

Random numbers aren't natural... and that's important

As another example, if you take a copy of a magazine like Reader’s Digest and read it through, noting down every number that is mentioned in the text, a Benford distribution is highly likely to appear before your eyes. Below is some real data collected by Benford himself in 1938 from a copy of Reader’s Digest.

Readers digest

The Reader's Digest, by the numbers

The bottom line is that random numbers don’t follow a Benford distribution but numbers that originate in the real world do. It isn’t an absolute rule, but it is a very good generalisation.

But why? Why on Earth would numbers be distributed like that?

This can be answered mathematically or by trying to give you an intuitive feel for why this happens. I prefer the latter so think about a river. It starts at a spring and runs to the coast so there is a linear distance between the two points. These linear distances will show a flat distribution of leading integers, so, why do real rivers show a Benford distribution?

The reason is that as the water first makes its way to the sea, it hits real world obstacles. Perhaps a big rock around which it has to flow, then later a plain where it meanders. In other words, that original distance inevitably increases with each obstacle and by a different amount each time. The rock makes a very small difference, the plain a much larger one.

To try to model this we can start with a set of random numbers that represent those initial linear lengths of a set of rivers; the distribution of the leading digits is flat.

Initial length

Initial linear lengths

Then we extend the river several times, each extension being a random per cent figure of the length of the river in question. (So, if we apply five extensions, the first might increase the length of the river by four per cent, the next by 17 per cent and so on.) The point here is that values in the wild are the outcome of combinations of factors: invoices are frequently made up of multiple items, plants grow to different heights depending on soil, climate, shelter, disease and so on.

The figures below are the results after five extensions and already we can see a Benford’s distribution.

After 5 random additions

After five random additions

Of course, we are not trying to model how rivers really form; we are illustrating a more fundamental property of numbers in general. If you make a series of percentage increases (or decreases) to a set of numbers, they will approximate to a Benford distribution.

The reason for this property of numbers is that numbers with different leading integers respond differently when changed in size. For example, if you take a number that starts with a one (say, 100) and make a 20 per cent change to it, then it becomes 120, which still begins with a one. But if you take a number beginning with nine (say, 900), then a 20 per cent change makes it 1,080 – which also begins with a one.

To put that another way:

The number one needs to increase by 100 per cent to become a two, the number five needs to increase by only 20 per cent to become a six and nine requires an increase of a measly 11.1 per cent to become a one. So the proportion of numbers starting with one goes up as we make changes, while the number starting with nine decreases and the rest change proportionately.

So, once you can see the pattern, you realise that Benford distribution isn’t an oddity, it is an inevitability. Is there a good, solid, mathematical underpinning to this?

Of course. Good resources for further study can be found by typing “Benford’s Law” into your preferred search engine and scanning the pages presented for mathematical equations.

But you really don’t need to understand the underlying mathematics in order to use and apply Benford’s Law. You just have to be able to see why the distribution is inevitable.

But what use is this knowledge?

Well, it’s good for fraud detection: you can ask the Arizona State Treasurer if you don’t believe me. In an example cited by accountancy journals for years after, a state employee was found guilty of trying to defraud the State of Arizona of around $1.8m in 1993. The staffer reportedly kept most of the fraudulent transactions just below a $100,000 limit, with an unusually large number starting with sevens, eights and nines. This resulted in a very non-Benford distribution and the example is held up as a classic case study in the effectiveness of using Benford’s to detect accounting fraud.

State of Arizona

Arizona Treasury case: by the numbers

But fraud detection is simply one of many potential uses. Now that you know that wild collected numbers usually show this distribution, you can look for sets of numbers that deviate. For example, a colleague of mine found this leading integer distribution in some direct debit/catalogue payments.

There is no suggestion that this is fraudulent but it did tell him that the matter was worthy of further investigation because “something” is actively responsible for this deviation. And that something might just be the nugget of information that good data scientists are expected to find.

And that is the real take-home message. If you see a Benford distribution, then it really is a case of “move along folks, nothing to see here”. If you see anything else, investigate further. ®

Beginner's guide to SSL certificates

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
'Urika': Cray unveils new 1,500-core big data crunching monster
6TB of DRAM, 38TB of SSD flash and 120TB of disk storage
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
SDI wars: WTF is software defined infrastructure?
This time we play for ALL the marbles
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Oracle hires former SAP exec for cloudy push
'We know Larry said cloud was gibberish, and insane, and idiotic, but...'
Symantec backs out of Backup Exec: Plans to can appliance in Jan
Will still provide support to existing customers
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
Win a year’s supply of chocolate
There is no techie angle to this competition so we're not going to pretend there is, but everyone loves chocolate so who cares.
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.