Feeds

Achtung! Use maths to smash the German tank problem – and your rival

Leaky data is a double agent in intelligence analysis

Intelligent flash storage arrays

Big Data's Big 5 You've employed Benford's Law to out fraudsters hidden in seemingly random numbers. Now what do you do if you need answers but some of your data is missing? Welcome to the German tank problem, the second in The Reg's guide to crafty techniques from the world of mathematics that can help you quickly solve niggling data problems.

The German Tank problem (and its solution) can allow you to estimate data that you don’t have in order to gain information that you want. A more formal statement of this process is: “The estimation of the maximum of a discrete uniform distribution by sampling (without replacement).” The less formal title, though, is snappier and gives a huge clue as to the provenance.

During the Second World War, the British became very interested in the number of tanks the Germans were producing. Intelligence sources suggested the production was about 1,000 to 1,500 tanks per month. In practice it was nowhere near that high and the wily British boffins found another much more accurate way of estimating production.

They got the squaddies on the battlefield to examine as many captured/disabled tanks as they could and collect all the serial numbers they could find. (In truth this must have been highly exasperating for the soldiers, who probably felt that knocking the tank out in the first place was good enough)

Let’s assume for a moment that tank serial numbers are simple. They all have numbers taken from the same series; the series starts at “1” and increments by one for each new tank.

If we find a tank with the serial number 24 then the Germans have produced at least 24 tanks. However we can be a bit smarter than that. Suppose we see the following numbers: 23, 16, 24, 17, 11, 6, 7, 9 and 12.

That’s nine numbers between seven and 24. If there were actually 4,000 tanks out there, it would seem highly unlikely that our sample would just happen to come from the first 24 produced. It is much more likely that there are only about 30-40 tanks in all out there. (To put that slightly more formally, it is likely that the entire population from which we have drawn our sample is around 30-40.)

And we can be even smarter. Suppose we examine five tanks (we’ll call the number of examined tanks E, so E=5) with serial numbers of: 432, 3245, 2187, 435, 1542.

The biggest serial number (which we’ll call B) is 3,245, and the smallest (s) is 432. We could reasonably expect as many ‘hidden’ numbers – serial numbers we guess must be out there but for which we have no data – above 3,245 as there are below 432.

In our nomenclature, n is our estimate of the number of tanks.

To put that graphically:

German tank line chart

And to put it mathematically:

we hypothesised that (n-B) is roughly equal to (s-1), so if:

n-B = s-1

then we can say that: n=B+(s-1)

which we can solve for n by slotting in the actual numbers.

In our case B=3,245, s=432, n=number of tanks produced, so:

n= 3,245 + (432-1) = 3,676

This is merely one formula for estimating the maximum of a discrete uniform distribution by sampling (without replacement). We can also use:

n = (1+1/E)*B

E, if you remember, is the number of tanks examined, which was five, so:

n = (1 +1/5)*3,245 = 3,894

This particular formula was proposed (PDF) by Roger Johnson of Carleton College in Northfield, Minnesota in the journal Teaching Statistics, Volume 16, Number 2, Summer 1994

You might also wish to try a Bayesian approach. I have no particular axe to grind here. I don’t mind how you do it; the important point is that you can.

Now, as it happens, there are several over-simplifications in my original description, both in the reality of WWII and also if we were trying to solve the same kind of problem today.

Providing a secure and efficient Helpdesk

More from The Register

next story
Preview redux: Microsoft ships new Windows 10 build with 7,000 changes
Latest bleeding-edge bits borrow Action Center from Windows Phone
Google opens Inbox – email for people too thick to handle email
Print this article out and give it to someone tech-y if you get stuck
Microsoft promises Windows 10 will mean two-factor auth for all
Sneak peek at security features Redmond's baking into new OS
FTDI yanks chip-bricking driver from Windows Update, vows to fight on
Next driver to battle fake chips with 'non-invasive' methods
UNIX greybeards threaten Debian fork over systemd plan
'Veteran Unix Admins' fear desktop emphasis is betraying open source
Entity Framework goes 'code first' as Microsoft pulls visual design tool
Visual Studio database diagramming's out the window
Google+ goes TITSUP. But WHO knew? How long? Anyone ... Hello ...
Wobbly Gmail, Contacts, Calendar on the other hand ...
prev story

Whitepapers

Why cloud backup?
Combining the latest advancements in disk-based backup with secure, integrated, cloud technologies offer organizations fast and assured recovery of their critical enterprise data.
A strategic approach to identity relationship management
ForgeRock commissioned Forrester to evaluate companies’ IAM practices and requirements when it comes to customer-facing scenarios versus employee-facing ones.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
New hybrid storage solutions
Tackling data challenges through emerging hybrid storage solutions that enable optimum database performance whilst managing costs and increasingly large data stores.