HP hunts down 'rare' BladeSystem problem
Only one power supply domain and if it fails ....
A power supply failure in HP BladeSystem c7000 enclosures can cause the whole BladeSystem to fail, the firm has admitted.
According to an HP advisory note: "HP has identified a potential, yet extremely rare issue with HP BladeSystem c7000 Enclosure 2250W Hot-Plug Power Supplies manufactured prior to March 20, 2008.
"This issue is extremely rare; however, if it does occur, the power supply may fail and this may result in the unplanned shutdown of the enclosure, despite redundancy, and the enclosure may become inoperable."
So, the issue is extremely rare, says HP. But it applies to any HP BladeSystem c7000 Enclosure configured with an HP c7000 Power Supply, if the power supply was manufactured before March 20, 2008. Each enclosure can have up to a total of six supplies.
Our understanding is that all the power supplies in the enclosure are connected together, forming a single power domain. The blades in the system connect to a single power bus. If the power supply fails then all the blades may stop working meaning that all their applications, including any virtual machines, go offline. Effectively, there is a single point of failure and redundancy limitation in the BladeSystem c7000 design.
HP's advisory goes on to say that: "BladeSystem c7000 Enclosure Power Supplies manufactured on or after March 20, 2008, and DC-powered enclosures (typically utilized in an Integrity blade environment) are not affected. To ensure stability of your computing environment, HP is providing a power supply identification utility to enable customers to identify potentially affected power supplies. Supplies identified by the utility will be replaced by HP."
There is more information about the identification utility here. Defective power supplies will be replaced free of charge.
The company provided a statement about the issue: "HP has been made aware of a very small number of incidents involving power supply failures in the BladeSystem c7000 enclosure. Because customer service and product quality are top priorities for HP, the company is working with HP BladeSystem customers to replace all potentially affected c7000 power supplies purchased by customers." ®
Having to lose two power supplies does not make a SPOF
To Anonymous Coward(Posted Thursday 15th January 2009 13:13 GMT), SPOF means Single Point of Failure. Losing two power supplies is Multiple Points of Failure...far less probable. Even so, in the IBM Bladecenter, the power supply pair that must fail together to take the enclosure down are connected to separate power harnesses, with each power harness meant to connect to separate external power grids. Therefore the failure of any single power grid will only take out one member of each of the two power supply pairs, leaving the enclosure running.
HP and Dell have also adopted the same design in splitting the six power supply inputs over two power grids. Unfortunately, instead of extending this redundancy to the DC side of the power supplies, they all converge onto one DC bus on one midplane.
The IBM Bladecenter has two separate midplanes, each one with it's own DC bus. That's why all IBM blade servers have two power connectors, they draw power from two separate DC buses. I don't know if the active components you speak of are hardware monitors or in the data and power paths...whatever the case may be, there are two duplicate sets because there are two midplanes, so again, no SPOF.
Was the HP power supply recall a result of a bad batch of power supplies? That does happen to every vendor from time to time, so it is plausible that this is just bad luck. However, the recall affects all power supplies for the C7000 manufactured before 20 March 2008, that is, since the launch of the C7000 in 2006. By their own calculation, HP claim to have shipped more than a million blades. Considering e-class, p-class and c-class, c-class is by far the most successful and would account for 500,000 blades or more. Assuming 6 power supplies for every 16 blades, that's around 180,000 power supplies! That is not a bad batch, it's an expensive design flaw. A profit making company would not make that kind of recall unless the cost of not doing it was even more costly...it makes you wonder about HP's definition of "extremely rare".
Unfortunately, the design flaw is not in the power supply (I would expect HP to be capable of making power supplies as good as IBM) but in not having a redundant DC bus. To fix this is a lot harder, because the midplane would need to be changed and a redundant power connector has to be added to every blade. This is a whole new architecture which would be incompatible with existing blades, something HP would loathe to do given that e-class, p-class and c-class blades are mutually incompatible.
So rather than fix the real problem, HP have elected to issue improved power supplies (probably with better DC fault isolation) to reduce the probability of failure. It's like issuing a recall on all cars to upgrade the suspension rather that fixing the potholes in the road that are causing the crashes in the first place. I can understand why they have done this, but it certainly convinces me that my VMware cluster is going to be deployed on rack mount servers rather than blades..at least not HP or Dell blades anyway.
I disagree with A/Coward.....2 power supplies with a redundant power domain is clearly proven to be more reliable than multiple power supplies in a single power domain....the reason HP/Dell use a single domain (single power connector on each blade) is to get a greater density. No matter how well you design components sometimes bad things happen....that's why you have redundant everything....especially in a chassis that hosts multiple physical servers, and each of them hosting multiple virtual servers. HP will "fix" the power supplies but the design compromise remains....and power supplies will still fail outside this "bad batch" problem.
The IBM blade chassis has 4 power supplies and if I lose the wrong two, I lose power to half my blades. There's no mystery - HP and Dell didn't copy this because it's an inferior design.
Furthermore, IBM also has a single active midplane, which is an even greater SPOF.