Feeds

SQL Server and the 7.5-day MTBF

How Oracle spun some benchmark stats

Top 5 reasons to deploy VMware with Tegile

Database Myths and Legends (Part 5) Press releases issued by software companies are one of the more common sources of myths and legends in the database world. No real surprise there you may think but therein lies a paradox. We all know that press releases are highly partisan, so we expect everyone to treat them with suspicion; yet we aren’t surprised when they influence opinion, as the following one certainly did.

If we go back to the year of our Lord two thousand and one we find that many people, particularly in the Oracle world, were attacking SQL Server on the grounds that, when clustered, it was horribly unreliable. (True, many people in the Oracle world are still delighted to tell you that SQL Server is unreliable in general, but in 2001 there was significant debate about clustering in particular.) This debate was sparked, at least in part, by a press release issued by Oracle at the beginning of that year. It said, and I quote: “Patented Technology Delivers Virtually Limitless Scalability To Any Application; Mean time To Failure Estimated To Be More Than 100,000 Years For Oracle Versus 7.5 Days for Microsoft SQL Server Cluster Configuration” Well, not too much equivocation there. Let’s see. The choice is a database that fails every 7.5 days or every 100,000 years. I know which one I am going to recommend to the board.

Now, cynics that we are, we might suspect that these numbers have been exaggerated slightly, but we can also think “But even if it is an exaggeration, there must be some truth in it, so SQL Server is certainly less reliable than Oracle, even if we aren’t sure by what factor.” The rather bizarre part about this particular case is that when I dug further into this story at the time, it turned out that Oracle’s ‘estimate’ of the disparity was actually much greater than these figures would suggest. In other words, Oracle’s numbers implied that its clusters were much, much more reliable than this. To see how, we need to take a look at the numbers that Oracle used for this extraordinary claim.

At that time Microsoft had just set a TPC-C benchmark using a cluster of 12 machines. According to Oracle, the way in which Microsoft had set up this cluster meant that if one node crashed, the entire cluster went down. Again according to Oracle, Microsoft’s own estimate was that a single node could be expected to crash, on average, once every 90 days. Clearly the more nodes you have in a cluster like this, the greater the chance that on any given day, one of them will crash and bring the cluster down. Given 12 nodes we’d expect 12 crashes in a 90 day period, which is approximately once every 7.5 days. So that’s where that figure came from.

Oracle said that in contrast, the failure of one of its nodes didn’t bring an Oracle cluster down; instead the cluster continued to run, albeit slightly more slowly. So, in direct contrast to the Microsoft cluster, the more nodes you added to an Oracle cluster, the greater resilience the cluster displays. For the entire cluster to crash, all twelve machines had to turn up their toes at the same time.

Well… nearly the same time. It turns out that a crucial figure we need here is the recovery time of the nodes. Suppose that it takes a day for a node to be recovered after it crashes. As soon as a node does crash, then a critical one day period starts. If the remaining 11 nodes all just happen to crash during that 24 hour period then the cluster goes down. However, if even one of the nodes manages to stay up for that one day, then the cluster survives. If the recovery time is smaller, say an hour, then it is far less likely that all remaining nodes will fail during that period, so the overall resilience of the cluster improves.

Sadly the Oracle spokesperson I contacted at the time was unable to tell me what recovery time the company had used for the calculation, but he did say that: “It would take a catastrophic event for all twelve nodes to be down at once, making the application unavailable to users. Using the same twelve node configuration, Oracle is up and running for 7 trillion years.” Now 7 trillion years is a big number in anyone’s book, but the press release used the much more modest figure of merely 100,000 years. When I asked why the lower figure had been used he told me that it had been used in order “to make the statement more plausible.”

I like this line of reasoning. It works for me. If your calculation yields a number that doesn’t sound convincing, change the number until it does. Think of the headlines I can now write.

Research proves that mechanics live twice as long as other people!

Recent research has shown that mechanics, on average, accidentally cut themselves once every 5 days. One cut doesn’t kill a mechanic, but 12 on the same day will be fatal in all cases. On any given day, the chance of a mechanic dying the death of a dozen cuts is 1 in 5 to the power 12. In other words, this will happen about once every 668,421 years. Hmmm. That sounds implausible; let’s call it 150 years which is about twice as long as a normal person lives.

You’ll notice, in my rush to write an arresting headline, I have also assumed that mechanics ONLY die from cutting themselves. Let’s now think about database clusters. Does anything other than node failure ever stop them?

Well, think about prosaic things such as database engine and/or operating system patches that require a reboot. And then there are power outages, malicious employees, meteors and plagues of locusts. (Don’t you hate those plagues? The little devils will jam the cooling systems and cause overheating.) Come to think of it, most buildings have design lives of less than 100 years, so we’re going to have to rebuild around the database cluster about 1,000 times – probably best to put the cluster on the ground floor.

Let’s face it, if you want to think in such mind boggling time scales as 100,000 years, you need to think about the rate at which other technologies are changing. For example, alternating current is likely to be just a passing fad (it has only been around since Nikola Tesla’s time – about 120 years). In fact, in late breaking news we have just heard that, by as early as the year 2134, conversion to flugelrad power is expected to be completed. Sadly, as I am sure you are aware, flugelrad power is completely incompatible with 115/240 volts AC. So any clusters currently running on that old fashioned electricity nonsense are going down and staying down.

And let’s question another implicit assumption in these figures. How often, in practice, are production SQL Server clusters set up, like this, with zero redundancy between the nodes? In my experience, for the very reasons outlined above, very few are; most database people are more sensible than that. So, in practice, this whole press release, and the arguments that it sparked, were based on a scenario that is not likely to be met in the real world.

You will have gathered by now that I don’t, personally, find these figures from Oracle convincing. I don’t think that they contributed anything meaningful to what was an important issue five years ago; what they did do was to start (or, at the least, contribute to) the myth that SQL Server clusters were unreliable.

But please bear in mind that this doesn’t mean that I think I have just somehow proved that SQL Server clusters were incredibly stable in 2001. I don’t think that this press release proved anything useful about the real world, so it is equally true that showing it as flawed also proves nothing about the real world. It is important that I make that clear. After all, I want my point to be plausible. ®

Choosing a cloud hosting partner with confidence

More from The Register

next story
Microsoft to bake Skype into IE, without plugins
Redmond thinks the Object Real-Time Communications API for WebRTC is ready to roll
Mozilla: Spidermonkey ATE Apple's JavaScriptCore, THRASHED Google V8
Moz man claims the win on rivals' own benchmarks
Microsoft promises Windows 10 will mean two-factor auth for all
Sneak peek at security features Redmond's baking into new OS
FTDI yanks chip-bricking driver from Windows Update, vows to fight on
Next driver to battle fake chips with 'non-invasive' methods
DEATH by PowerPoint: Microsoft warns of 0-day attack hidden in slides
Might put out patch in update, might chuck it out sooner
Ubuntu 14.10 tries pulling a Steve Ballmer on cloudy offerings
Oi, Windows, centOS and openSUSE – behave, we're all friends here
Apple's OS X Yosemite slurps UNSAVED docs into iCloud
Docs, email contacts... shhhlooop, up it goes
Was ist das? Eine neue Suse Linux Enterprise? Ausgezeichnet!
Version 12 first major-number Suse release since 2009
prev story

Whitepapers

Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Saudi Petroleum chooses Tegile storage solution
A storage solution that addresses company growth and performance for business-critical applications of caseware archive and search along with other key operational systems.
Getting ahead of the compliance curve
Learn about new services that make it easy to discover and manage certificates across the enterprise and how to get ahead of the compliance curve.