Don't believe the IT hype: Ye cannae change the laws of physics
Don't expect any miracles from the shiny new database you've bought (or compiled)
It's fun to be on the receiving end of IT advertising. The vendor's ads start by promising to solve your business problems better than the competition can, and then the superlatives begin to snowball until an answer to global warming and a solution for war in Iraq are both in there among the plug-ins you can buy to make your purchase extra-worthy.
There comes a point, however, where you have to ask yourself: why do I need to buy this? Is it really all it's cracked up to be? Could I do it just as well myself, or via some other means?
Let's take a real-life example. I used to consult for a variety of travel companies, and all of them wanted at some point to do what seems like a pretty simple database search: given a grid reference, show me the 10/20/50 nearest hotels/villas/towns to that grid reference. Sounds simple, doesn't it? But it isn't - quite the opposite, in fact.
The problem is that when you have a big database of destinations, you rely on table indexes to make your searches fast. Searching for a post code's easy - create an index based on the postcode and it'll fly. Same applies for the name of the town, or perhaps the county it's in. But we didn't want to do that: we wanted to search for places based on their distance from a user-supplied point. (If you're thinking at this point that we could have done it based on county: no, you can't, because a destination in the next county could be 50 miles closer than one at the other end of your own county). So we needed to do searches that included calculations, such as Great-Circle calculations - a nice bit of 3D trigonometry that gets the brain working.
Now, if you're doing this kind of calculation you can't rely on indexes since the nature of the search is that each time the user does a search you have to do the calculation on every item in the database table in order to figure out which ones are closest to your reference point. This means every search does a full table scan - not pretty when you have 100,000 or more destinations.
So we came up with a compromise: fudge the query so you limit the number of destinations you do the calculation on. If you store the longitude and latitude of the destinations, and index the table on those values, you can break the query into two:
- Figure out the minimum and maximum longitude and latitude to draw a "box" around the location you're interested in - a box that's big enough to include enough destinations but small enough not to include too many
- Do the calculation on only the destinations that qualify to sort them by distance
So this is what we did - a nice, index-led query that selected the destinations that were within a few minutes of latitude and longitude of the point we cared about, which was then fed into the query that did the Great Circle calculation. This would chop out maybe 99.5 per cent of the data set on average, and the search would fly.
This was, of course, just a bodge. In parallel with what we were doing, one of my clients engaged one of the big mapping software companies to implement its software in our systems. It was pretty speedy, though not that much faster than our version. One day I found myself in the pub with one of their techies (as you do) and I asked him how they did it. "Ah, it's not that hard," said he. "All you do is pick the bit of the data set that is an approximate fit, then do your calculation on the rest."
Turned out that their initial selection was a little more scientific than ours (so its candidate set was a bit better) and they had a slightly more accurate Great Circle algorithm that was a bit more clever about the Earth's curvature, but they'd pretty much done it the same way as we had.
Computations and limitations
Which brings us back to the title of this piece: no matter who you are or how clever your programmers, you can't do miracles in computer algorithms. Anyone who has ever done formal complexity analysis on algorithms (as I had to at university) will know that there's no magic wand.
A problem whose complexity grows exponentially (for example: it takes one second to work on one item, two seconds for two items, four seconds for three items, eight seconds for four, and so on) can't miraculously be solved quickly for a large data set.
The best you can do is find a cunning way to crunch the numbers faster - by disregarding bits of data that you know aren't relevant, for instance, or by using a parallel-processing system, or by using a heuristic algorithm that gives a slightly less (but still acceptably) accurate solution with a significantly smaller amount of work.
So why is this all relevant? Simple - it's what I alluded to in the first sentence. Namely that just because someone says their product is great doesn't mean that they're doing anything special or, for that matter, anything you couldn't do yourself. Most importantly, though, it reminds us that expensive commercial products don't necessarily have functionality that's proportionate to their cost.
Let's take an analogy: the Co-Op does a spectacular Fairtrade Malbec (a favourite among my dinner guests) for around nine quid a bottle. But is a £90 bottle of wine 10 times nicer, or appreciated 10 times as much by my guests? No. While some people would appreciate it proportionately, most don't.
Similarly, many clients I've worked with use MySQL, which costs precisely nothing to buy. Others use SQL Server and Oracle. In many cases with software today, the open-source offering is just as good as the multi-thousand-pound commercial alternative for many customers because in the average case it's really not that difficult to make something perform.
Of course, if you want tens of millions of rows and super-fast, concurrent processing then you'd choose the commercial option. But in line with the classic 80-20 rule, the free or inexpensive package is pretty much as good as the expensive one. There just aren't very many ways to represent and index your data, and they're all really well understood and taught at all good universities.
As Lieutenant Commander Montgomery Scott would remind us, were his remains not divided between Puget Sound and outer space: even the expensive commercial solutions cannae change the laws of physics. ®
Dave Cartwright is a senior network and telecoms specialist who has spent 20 years working in academia, defence, publishing and intellectual property. He is the founding and technical editor of Network Week and Techworld and his specialities include design, construction and management of global telecoms networks, infrastructure and software architecture, development and testing, database design, implementation and optimization. Dave and his family live in St Helier on the island paradise of Jersey.
You are of course 100% correct. In most cases the Open Source option performs just as well.
But when it breaks (and they all do), somebody is going to get kicked! But who?
If you've paid Oracle et al for that support along with the product, you get to kick them. If you pay for support for the Open Source, you get to kick the support supplier. If you didn't pay for support, time to slide a tea tray down your pants.
I'm not saying its your fault, I'm saying you're getting the blame. Open Source can be cheaper - but never view it as free.
What's good enough for you?
And the rule that so many people miss is:
What's good enough for you?
The answer varies wildly and in the same way that a county of libraries and schools don't all need expensive Cisco switches on every site, you often don't need too much complexity, or top-end software or huge expensive products to get a job done.
I get asked by people who work in IT (but not schools) why the schools I've worked for don't "just run Exchange", for example. They could. But they don't need it. So they don't have it. Their email is covered. Their calendaring is non-existent or very specialist. The other features they wouldn't understand, let alone use. And by the time you set it all up, what you've actually "saved" (in terms of some poor sod not having to make up a calendar or appointments once in a blue moon) is actually chicken-feed next to the overall cost of implementation.
There are some things where you can just say "Why?" and do it yourself quicker or cheaper than the original solutions because you just don't know if it will work, be used, have value, etc. Sometimes, yes, they expand and you find yourself having to migrate to a bigger product anyway, but a lot of the time the system deployed is perfectly adequate or nobody even uses it and thus it's cheaper to keep it running and/or throw it in the bin.
I put a Linux server into some of my schools - the initial reason often differs but I've put them in for everything from web-caching (overloaded external lines due to no decent on-site caching), web-filtering, or even just things like internal web-serving without having to deploy IIS.
Sometimes they weren't sure about it, but the reason it was put in varied and I clearly stated "this is a test" - a free test to see if the feature they demand is what they want, works how they want, can be done more cheaply than existing products, will be used, etc. They are free to say it's inadequate if they can show it is (and, hell, then I'll be the first to rip it out and replace it with something else because that means PROGRESS).
And in more schools that not, years later the "free test" server (usually some old office machine or spare server) is still running and doing the job it was deployed for AND MORE. Because it ends up being just that useful to warrant its continued existence but not unsuitable enough to warrant changing it for a piece of hardware dedicated to that purpose with support, etc. Hell, in one school I was given the job of making their "network go faster" and in the first week tripled their Internet speeds (measured by the Head using Speedtest, because he was obsessed with it) in the middle of the day just by some transparent proxy/caching. If the box broke, or there was a problem, I demonstrated how taking the "in" Ethernet and the "out" Ethernet out of the machine, throwing the machine away, and joining them with a connector made everyone else carry on working exactly how they always had (and, thus, we could just throw it away if it was a problem).
When I upgraded the machine three years later (without a single day of downtime), the connector was still in the same "dust shadow" on the top of the machine. It was even "good enough" to deploy more services and more things on it - some of them got used, some of them didn't, some of them were useful, some of them weren't, some of them were better than a commercial equivalent, some weren't.
Anyone who just spends a fortune on something, deploys it, and doesn't do small scale tests to see if they even NEEDED it first isn't really doing their job. The best way is to build an in-house, temporary solution and see how far you get. If you can't get the results that you DEFINITELY get with a better product, buy the product - you've proved it's worth having. If you get results similar to it or better, why bother to change from in-house? You've proved your own solution is better.
"Good enough". It's said like it's a bad thing, if you say it wrong, but it means the exact opposite. I wish my bank account were "good enough" for the things I wanted to do. My current school has a Mediawiki Intranet. It's "good enough" that it hasn't changed in years, and we haven't bothered to trial any alternatives. It has a IT ticket system. It's "good enough" that it hasn't changed in years, and we haven't bothered to trial any alternatives. It runs all sorts of things that we would have had to buy, but a quick test of something in-house, custom-made or just open-source etc. was often "good enough" that we don't bother to go any further.
Sure, there are advantages depending on your scale and business to having outside support, or being able to hire people instantly familiar with the products you're using, etc. But for most stuff, an in-house or even a cheaper solution is "good enough", so why spend money on something that won't provide tangible benefits over that "good enough" solution? Even if you have to throw it out when the guy leaves and put in something else because you can't replicate his solution, surely that's saved you years of licensing and support that you would otherwise have had to pay for? And if his solution wasn't "good enough" while he was there, you should have been replacing it/him anyway.
Most small businesses do not need Oracle, Cisco or even Microsoft. They just don't. You can point out flaws, support issues or whatever else but the fact of the matter is if they are happy and the system is good enough for them, why not use it? Why spend money you don't have to? If they have to take contingency and staffing skills into account, they will, it still doesn't mean it's not better to have some solution for X amount of years before they determine they NEED something more. It's just saved them X years of support, licensing, maintenance, training, etc.
It's a disease rife within the upper echelons of IT where people who deal with millions of desktops tell people who have a few hundred what they "should" be doing. You can normally spot them by the way they have no consideration of buying extra licenses or vast systems. Some things just don't scale. And a bodge in an SQL script to save hours of database time for a one-off (or even only occasional) query? Yeah, I'll take that bodge, please. If it works adequately, I have no need to do anything else. If it doesn't, well I've cut out some of the work in the meantime and will have to do EXACTLY what I would have needed to do anyway - go find a better solution. Even if it only works for a week - that's an extra week I can spend researching and testing alternatives.
Guess what? I object to paying for an SQL Server license and CAL's just to run a piece of access control software that stores a few hundred entries a day of an ID card number and door number, or something fires up the school boilers at 6am each day. EVEN if I already have an SQL Server doing other things. And especially when they have the option to use a MySQL database or whatever instead (currently have one piece of access control software that uses FireBird as an internal SQL server if you don't have one of your own - works perfectly and with the right software you can query it like it was any other remote database using standard SQL).
I have several MySQL instances for various things, I have cron scripts giving my scanners a helping hand to archive scanned documents (the scanners don't have such functionality, so I have them save scans to a server, which copies them to the right destination and archives them for my purposes too - sure we have one huge expensive photocopier that's capable of doing that itself too, but it's easier to tie it into the same system that even the featureless scanners can benefit from), I have fax-to-email gateways run by scripts on a Linux server (when I could just buy a proper fax add-on for the telephony system we have, but that's expensive and a USB-56k fax modem was much cheaper and more powerful, even with the software setup), I have all sorts of similar time-saving, money-saving and effort-saving tricks running.
Good enough for myself, my employers and my purposes might not be good enough for you. But that's not your decision. It might have flaws, but they might not affect my usage. And it may not be ideal, but that depends on how much money and in-house support rather than journeys through phone menus contributes to your vision of "ideal".
And, at worst, I have to do what you'd like me to anyway. If it buys me a year of having to do that, that's a year spent on other things (both intellectually and financially) - maybe I spend the money saved on the "proper" solution for something much more vital and expensive, for instance!
The only question is "What is good enough?".
Mr. Cartwright, you have heard of spatial indices, have you?
http://www.cs.bgu.ac.il/~atdb082/wiki.files/paper6.pdf (one of the seminal papers in the field).
Implemented by most if not all major DBMS these days to more than acceptable levels of performance. If we are talking about a travelling salesman solution (as opposed to straight line distances) that can also be arranged by using the proper tools.
I am sorry, but your example came across as so badly chosen that I could not follow your point.
As for your other remark regarding choosing the commercial option « if you want tens of millions of rows and super-fast, concurrent processing », may I point out that this is also a bit of an unfortunate generalisation? There comes a point when maintaining your own (possibly FOSS-based) solution makes more sense that going for a "commercial" (i.e., off-the-shelf) offering, as the likes of Google, Twitter, Facebook, Amazon, etc., etc., can attest.
What management is really paying for ...
...is someone to blame.
I've no experience of Oracle support, but my experience of IBM both as a customer and within the organisation was that unless you're a very important/rich customer you waste more time explaining the bug over and over again through various levels of support than it would take to just code around the problem.
When you finally get acknowledgement that the bug is really a bug, you then have to implement the workaround anyway because you can't afford to wait for it to be fixed.
Consequently, since I moved to mostly open source stacks I'm more productive. From my experience of Hibernate support a few years ago, at least the developers would tell you to fuck off immediately rather than waste your time.
Spatial indices on MySQL?
Must have missed it. PostGIS is the way to go if you want to do this sort of thing with open source. I'm getting increasingly sick of comparisons of MySQL (in whichever flavour) with real RDBMS's. The hacks and kludges are just too painful to endure for anything other than toy projects.
By all means get a support contract and pay for DBAs and programmers who know their shit.