SQL Server and the 7.5-day MTBF

How Oracle spun some benchmark stats

High performance access to file storage

Database Myths and Legends (Part 5) Press releases issued by software companies are one of the more common sources of myths and legends in the database world. No real surprise there you may think but therein lies a paradox. We all know that press releases are highly partisan, so we expect everyone to treat them with suspicion; yet we aren’t surprised when they influence opinion, as the following one certainly did.

If we go back to the year of our Lord two thousand and one we find that many people, particularly in the Oracle world, were attacking SQL Server on the grounds that, when clustered, it was horribly unreliable. (True, many people in the Oracle world are still delighted to tell you that SQL Server is unreliable in general, but in 2001 there was significant debate about clustering in particular.) This debate was sparked, at least in part, by a press release issued by Oracle at the beginning of that year. It said, and I quote: “Patented Technology Delivers Virtually Limitless Scalability To Any Application; Mean time To Failure Estimated To Be More Than 100,000 Years For Oracle Versus 7.5 Days for Microsoft SQL Server Cluster Configuration” Well, not too much equivocation there. Let’s see. The choice is a database that fails every 7.5 days or every 100,000 years. I know which one I am going to recommend to the board.

Now, cynics that we are, we might suspect that these numbers have been exaggerated slightly, but we can also think “But even if it is an exaggeration, there must be some truth in it, so SQL Server is certainly less reliable than Oracle, even if we aren’t sure by what factor.” The rather bizarre part about this particular case is that when I dug further into this story at the time, it turned out that Oracle’s ‘estimate’ of the disparity was actually much greater than these figures would suggest. In other words, Oracle’s numbers implied that its clusters were much, much more reliable than this. To see how, we need to take a look at the numbers that Oracle used for this extraordinary claim.

At that time Microsoft had just set a TPC-C benchmark using a cluster of 12 machines. According to Oracle, the way in which Microsoft had set up this cluster meant that if one node crashed, the entire cluster went down. Again according to Oracle, Microsoft’s own estimate was that a single node could be expected to crash, on average, once every 90 days. Clearly the more nodes you have in a cluster like this, the greater the chance that on any given day, one of them will crash and bring the cluster down. Given 12 nodes we’d expect 12 crashes in a 90 day period, which is approximately once every 7.5 days. So that’s where that figure came from.

Oracle said that in contrast, the failure of one of its nodes didn’t bring an Oracle cluster down; instead the cluster continued to run, albeit slightly more slowly. So, in direct contrast to the Microsoft cluster, the more nodes you added to an Oracle cluster, the greater resilience the cluster displays. For the entire cluster to crash, all twelve machines had to turn up their toes at the same time.

Well… nearly the same time. It turns out that a crucial figure we need here is the recovery time of the nodes. Suppose that it takes a day for a node to be recovered after it crashes. As soon as a node does crash, then a critical one day period starts. If the remaining 11 nodes all just happen to crash during that 24 hour period then the cluster goes down. However, if even one of the nodes manages to stay up for that one day, then the cluster survives. If the recovery time is smaller, say an hour, then it is far less likely that all remaining nodes will fail during that period, so the overall resilience of the cluster improves.

Sadly the Oracle spokesperson I contacted at the time was unable to tell me what recovery time the company had used for the calculation, but he did say that: “It would take a catastrophic event for all twelve nodes to be down at once, making the application unavailable to users. Using the same twelve node configuration, Oracle is up and running for 7 trillion years.” Now 7 trillion years is a big number in anyone’s book, but the press release used the much more modest figure of merely 100,000 years. When I asked why the lower figure had been used he told me that it had been used in order “to make the statement more plausible.”

I like this line of reasoning. It works for me. If your calculation yields a number that doesn’t sound convincing, change the number until it does. Think of the headlines I can now write.

Research proves that mechanics live twice as long as other people!

Recent research has shown that mechanics, on average, accidentally cut themselves once every 5 days. One cut doesn’t kill a mechanic, but 12 on the same day will be fatal in all cases. On any given day, the chance of a mechanic dying the death of a dozen cuts is 1 in 5 to the power 12. In other words, this will happen about once every 668,421 years. Hmmm. That sounds implausible; let’s call it 150 years which is about twice as long as a normal person lives.

You’ll notice, in my rush to write an arresting headline, I have also assumed that mechanics ONLY die from cutting themselves. Let’s now think about database clusters. Does anything other than node failure ever stop them?

Well, think about prosaic things such as database engine and/or operating system patches that require a reboot. And then there are power outages, malicious employees, meteors and plagues of locusts. (Don’t you hate those plagues? The little devils will jam the cooling systems and cause overheating.) Come to think of it, most buildings have design lives of less than 100 years, so we’re going to have to rebuild around the database cluster about 1,000 times – probably best to put the cluster on the ground floor.

Let’s face it, if you want to think in such mind boggling time scales as 100,000 years, you need to think about the rate at which other technologies are changing. For example, alternating current is likely to be just a passing fad (it has only been around since Nikola Tesla’s time – about 120 years). In fact, in late breaking news we have just heard that, by as early as the year 2134, conversion to flugelrad power is expected to be completed. Sadly, as I am sure you are aware, flugelrad power is completely incompatible with 115/240 volts AC. So any clusters currently running on that old fashioned electricity nonsense are going down and staying down.

And let’s question another implicit assumption in these figures. How often, in practice, are production SQL Server clusters set up, like this, with zero redundancy between the nodes? In my experience, for the very reasons outlined above, very few are; most database people are more sensible than that. So, in practice, this whole press release, and the arguments that it sparked, were based on a scenario that is not likely to be met in the real world.

You will have gathered by now that I don’t, personally, find these figures from Oracle convincing. I don’t think that they contributed anything meaningful to what was an important issue five years ago; what they did do was to start (or, at the least, contribute to) the myth that SQL Server clusters were unreliable.

But please bear in mind that this doesn’t mean that I think I have just somehow proved that SQL Server clusters were incredibly stable in 2001. I don’t think that this press release proved anything useful about the real world, so it is equally true that showing it as flawed also proves nothing about the real world. It is important that I make that clear. After all, I want my point to be plausible. ®

High performance access to file storage

More from The Register

next story
Android engineer: We DIDN'T copy Apple OR follow Samsung's orders
Veep testifies for Samsung during Apple patent trial
This time it's 'Personal': new Office 365 sub covers just two devices
Redmond also brings Office into Google's back yard
Batten down the hatches, Ubuntu 14.04 LTS due in TWO DAYS
Admins dab straining server brows in advance of Trusty Tahr's long-term support landing
Microsoft lobs pre-release Windows Phone 8.1 at devs who dare
App makers can load it before anyone else, but if they do they're stuck with it
Half of Twitter's 'active users' are SILENT STALKERS
Nearly 50% have NEVER tweeted a word
Windows XP still has 27 per cent market share on its deathbed
Windows 7 making some gains on XP Death Day
Internet-of-stuff startup dumps NoSQL for ... SQL?
NoSQL taste great at first but lacks proper nutrients, says startup cloud whiz
Windows 8.1, which you probably haven't upgraded to yet, ALREADY OBSOLETE
Pre-Update versions of new Windows version will no longer support patches
Microsoft TIER SMEAR changes app prices whether devs ask or not
Some go up, some go down, Redmond goes silent
Red Hat to ship RHEL 7 release candidate with a taste of container tech
Grab 'near-final' version of next Enterprise Linux next week
prev story


Securing web applications made simple and scalable
In this whitepaper learn how automated security testing can provide a simple and scalable way to protect your web applications.
Five 3D headsets to be won!
We were so impressed by the Durovis Dive headset we’ve asked the company to give some away to Reg readers.
HP ArcSight ESM solution helps Finansbank
Based on their experience using HP ArcSight Enterprise Security Manager for IT security operations, Finansbank moved to HP ArcSight ESM for fraud management.
The benefits of software based PBX
Why you should break free from your proprietary PBX and how to leverage your existing server hardware.
Mobile application security study
Download this report to see the alarming realities regarding the sheer number of applications vulnerable to attack, as well as the most common and easily addressable vulnerability errors.