Don't believe the IT hype: Ye cannae change the laws of physics
Don't expect any miracles from the shiny new database you've bought (or compiled)
It's fun to be on the receiving end of IT advertising. The vendor's ads start by promising to solve your business problems better than the competition can, and then the superlatives begin to snowball until an answer to global warming and a solution for war in Iraq are both in there among the plug-ins you can buy to make your purchase extra-worthy.
There comes a point, however, where you have to ask yourself: why do I need to buy this? Is it really all it's cracked up to be? Could I do it just as well myself, or via some other means?
Let's take a real-life example. I used to consult for a variety of travel companies, and all of them wanted at some point to do what seems like a pretty simple database search: given a grid reference, show me the 10/20/50 nearest hotels/villas/towns to that grid reference. Sounds simple, doesn't it? But it isn't - quite the opposite, in fact.
The problem is that when you have a big database of destinations, you rely on table indexes to make your searches fast. Searching for a post code's easy - create an index based on the postcode and it'll fly. Same applies for the name of the town, or perhaps the county it's in. But we didn't want to do that: we wanted to search for places based on their distance from a user-supplied point. (If you're thinking at this point that we could have done it based on county: no, you can't, because a destination in the next county could be 50 miles closer than one at the other end of your own county). So we needed to do searches that included calculations, such as Great-Circle calculations - a nice bit of 3D trigonometry that gets the brain working.
Now, if you're doing this kind of calculation you can't rely on indexes since the nature of the search is that each time the user does a search you have to do the calculation on every item in the database table in order to figure out which ones are closest to your reference point. This means every search does a full table scan - not pretty when you have 100,000 or more destinations.
So we came up with a compromise: fudge the query so you limit the number of destinations you do the calculation on. If you store the longitude and latitude of the destinations, and index the table on those values, you can break the query into two:
- Figure out the minimum and maximum longitude and latitude to draw a "box" around the location you're interested in - a box that's big enough to include enough destinations but small enough not to include too many
- Do the calculation on only the destinations that qualify to sort them by distance
So this is what we did - a nice, index-led query that selected the destinations that were within a few minutes of latitude and longitude of the point we cared about, which was then fed into the query that did the Great Circle calculation. This would chop out maybe 99.5 per cent of the data set on average, and the search would fly.
This was, of course, just a bodge. In parallel with what we were doing, one of my clients engaged one of the big mapping software companies to implement its software in our systems. It was pretty speedy, though not that much faster than our version. One day I found myself in the pub with one of their techies (as you do) and I asked him how they did it. "Ah, it's not that hard," said he. "All you do is pick the bit of the data set that is an approximate fit, then do your calculation on the rest."
Turned out that their initial selection was a little more scientific than ours (so its candidate set was a bit better) and they had a slightly more accurate Great Circle algorithm that was a bit more clever about the Earth's curvature, but they'd pretty much done it the same way as we had.
Computations and limitations
Which brings us back to the title of this piece: no matter who you are or how clever your programmers, you can't do miracles in computer algorithms. Anyone who has ever done formal complexity analysis on algorithms (as I had to at university) will know that there's no magic wand.
A problem whose complexity grows exponentially (for example: it takes one second to work on one item, two seconds for two items, four seconds for three items, eight seconds for four, and so on) can't miraculously be solved quickly for a large data set.
The best you can do is find a cunning way to crunch the numbers faster - by disregarding bits of data that you know aren't relevant, for instance, or by using a parallel-processing system, or by using a heuristic algorithm that gives a slightly less (but still acceptably) accurate solution with a significantly smaller amount of work.
So why is this all relevant? Simple - it's what I alluded to in the first sentence. Namely that just because someone says their product is great doesn't mean that they're doing anything special or, for that matter, anything you couldn't do yourself. Most importantly, though, it reminds us that expensive commercial products don't necessarily have functionality that's proportionate to their cost.
Let's take an analogy: the Co-Op does a spectacular Fairtrade Malbec (a favourite among my dinner guests) for around nine quid a bottle. But is a £90 bottle of wine 10 times nicer, or appreciated 10 times as much by my guests? No. While some people would appreciate it proportionately, most don't.
Similarly, many clients I've worked with use MySQL, which costs precisely nothing to buy. Others use SQL Server and Oracle. In many cases with software today, the open-source offering is just as good as the multi-thousand-pound commercial alternative for many customers because in the average case it's really not that difficult to make something perform.
Of course, if you want tens of millions of rows and super-fast, concurrent processing then you'd choose the commercial option. But in line with the classic 80-20 rule, the free or inexpensive package is pretty much as good as the expensive one. There just aren't very many ways to represent and index your data, and they're all really well understood and taught at all good universities.
As Lieutenant Commander Montgomery Scott would remind us, were his remains not divided between Puget Sound and outer space: even the expensive commercial solutions cannae change the laws of physics. ®
Dave Cartwright is a senior network and telecoms specialist who has spent 20 years working in academia, defence, publishing and intellectual property. He is the founding and technical editor of Network Week and Techworld and his specialities include design, construction and management of global telecoms networks, infrastructure and software architecture, development and testing, database design, implementation and optimization. Dave and his family live in St Helier on the island paradise of Jersey.
Sponsored: Hyper-scale data management