SQL Server and the 7.5-day MTBF
How Oracle spun some benchmark stats
Database Myths and Legends (Part 5) Press releases issued by software companies are one of the more common sources of myths and legends in the database world. No real surprise there you may think but therein lies a paradox. We all know that press releases are highly partisan, so we expect everyone to treat them with suspicion; yet we aren’t surprised when they influence opinion, as the following one certainly did.
If we go back to the year of our Lord two thousand and one we find that many people, particularly in the Oracle world, were attacking SQL Server on the grounds that, when clustered, it was horribly unreliable. (True, many people in the Oracle world are still delighted to tell you that SQL Server is unreliable in general, but in 2001 there was significant debate about clustering in particular.) This debate was sparked, at least in part, by a press release issued by Oracle at the beginning of that year. It said, and I quote: “Patented Technology Delivers Virtually Limitless Scalability To Any Application; Mean time To Failure Estimated To Be More Than 100,000 Years For Oracle Versus 7.5 Days for Microsoft SQL Server Cluster Configuration” Well, not too much equivocation there. Let’s see. The choice is a database that fails every 7.5 days or every 100,000 years. I know which one I am going to recommend to the board.
Now, cynics that we are, we might suspect that these numbers have been exaggerated slightly, but we can also think “But even if it is an exaggeration, there must be some truth in it, so SQL Server is certainly less reliable than Oracle, even if we aren’t sure by what factor.” The rather bizarre part about this particular case is that when I dug further into this story at the time, it turned out that Oracle’s ‘estimate’ of the disparity was actually much greater than these figures would suggest. In other words, Oracle’s numbers implied that its clusters were much, much more reliable than this. To see how, we need to take a look at the numbers that Oracle used for this extraordinary claim.
At that time Microsoft had just set a TPC-C benchmark using a cluster of 12 machines. According to Oracle, the way in which Microsoft had set up this cluster meant that if one node crashed, the entire cluster went down. Again according to Oracle, Microsoft’s own estimate was that a single node could be expected to crash, on average, once every 90 days. Clearly the more nodes you have in a cluster like this, the greater the chance that on any given day, one of them will crash and bring the cluster down. Given 12 nodes we’d expect 12 crashes in a 90 day period, which is approximately once every 7.5 days. So that’s where that figure came from.
Oracle said that in contrast, the failure of one of its nodes didn’t bring an Oracle cluster down; instead the cluster continued to run, albeit slightly more slowly. So, in direct contrast to the Microsoft cluster, the more nodes you added to an Oracle cluster, the greater resilience the cluster displays. For the entire cluster to crash, all twelve machines had to turn up their toes at the same time.
Well… nearly the same time. It turns out that a crucial figure we need here is the recovery time of the nodes. Suppose that it takes a day for a node to be recovered after it crashes. As soon as a node does crash, then a critical one day period starts. If the remaining 11 nodes all just happen to crash during that 24 hour period then the cluster goes down. However, if even one of the nodes manages to stay up for that one day, then the cluster survives. If the recovery time is smaller, say an hour, then it is far less likely that all remaining nodes will fail during that period, so the overall resilience of the cluster improves.
Sadly the Oracle spokesperson I contacted at the time was unable to tell me what recovery time the company had used for the calculation, but he did say that: “It would take a catastrophic event for all twelve nodes to be down at once, making the application unavailable to users. Using the same twelve node configuration, Oracle is up and running for 7 trillion years.” Now 7 trillion years is a big number in anyone’s book, but the press release used the much more modest figure of merely 100,000 years. When I asked why the lower figure had been used he told me that it had been used in order “to make the statement more plausible.”
I like this line of reasoning. It works for me. If your calculation yields a number that doesn’t sound convincing, change the number until it does. Think of the headlines I can now write.
Research proves that mechanics live twice as long as other people!
Recent research has shown that mechanics, on average, accidentally cut themselves once every 5 days. One cut doesn’t kill a mechanic, but 12 on the same day will be fatal in all cases. On any given day, the chance of a mechanic dying the death of a dozen cuts is 1 in 5 to the power 12. In other words, this will happen about once every 668,421 years. Hmmm. That sounds implausible; let’s call it 150 years which is about twice as long as a normal person lives.
You’ll notice, in my rush to write an arresting headline, I have also assumed that mechanics ONLY die from cutting themselves. Let’s now think about database clusters. Does anything other than node failure ever stop them?
Well, think about prosaic things such as database engine and/or operating system patches that require a reboot. And then there are power outages, malicious employees, meteors and plagues of locusts. (Don’t you hate those plagues? The little devils will jam the cooling systems and cause overheating.) Come to think of it, most buildings have design lives of less than 100 years, so we’re going to have to rebuild around the database cluster about 1,000 times – probably best to put the cluster on the ground floor.
Let’s face it, if you want to think in such mind boggling time scales as 100,000 years, you need to think about the rate at which other technologies are changing. For example, alternating current is likely to be just a passing fad (it has only been around since Nikola Tesla’s time – about 120 years). In fact, in late breaking news we have just heard that, by as early as the year 2134, conversion to flugelrad power is expected to be completed. Sadly, as I am sure you are aware, flugelrad power is completely incompatible with 115/240 volts AC. So any clusters currently running on that old fashioned electricity nonsense are going down and staying down.
And let’s question another implicit assumption in these figures. How often, in practice, are production SQL Server clusters set up, like this, with zero redundancy between the nodes? In my experience, for the very reasons outlined above, very few are; most database people are more sensible than that. So, in practice, this whole press release, and the arguments that it sparked, were based on a scenario that is not likely to be met in the real world.
You will have gathered by now that I don’t, personally, find these figures from Oracle convincing. I don’t think that they contributed anything meaningful to what was an important issue five years ago; what they did do was to start (or, at the least, contribute to) the myth that SQL Server clusters were unreliable.
But please bear in mind that this doesn’t mean that I think I have just somehow proved that SQL Server clusters were incredibly stable in 2001. I don’t think that this press release proved anything useful about the real world, so it is equally true that showing it as flawed also proves nothing about the real world. It is important that I make that clear. After all, I want my point to be plausible. ®