Original URL: http://www.theregister.co.uk/2008/09/30/data_centre_risk/

Get ready for the coming data centre crunch

Can you go short on a power hungry server?

By Guy Kewney

Posted in Servers, 30th September 2008 07:02 GMT

"If there's going to be a theme of the Press Summit this year," mused one delegate on the flight to Portugal, "then it's going to be power, and heat." He should have been right.

We covered femtocells, 100-gig Ethernet, managed wireless, specialised security-oriented operating software for network switches, media gateways and (briefly) "green computing". The green session, I expected, would reveal the details of some hairy truths.

For example, everybody in the co-lo business in every major metro area knows that the internet is on a cliff-edge because of electricity problems. The costs of providing power for huge centres like Telehouse and Red Bus down in London's Docklands are huge - but it's not the cost that frightens the planners. The question which has them shaking their heads is: "What will we do when it all hits the ceiling?" And many of them believe it will happen soon. They say, quite simply, that there isn't enough power for the equipment that's already installed; and that new equipment will need even more power. And it's simply not available.

I was talking to one medium-sized ISP about their move to a Maidenhead centre. "What about getting peering links to other internet centres?" I asked him. "Don't you have to be there for that?" He shook his head with vigour: No. "Oh, transit - that's not a problem. Two years ago, if you'd asked me I would have said yes, transit is our main concern, but today, it's power. We can't stay in Docklands. We can't get the power." And it's not a question of "they'll put the price up" or "it's carbon-careless" but "they simply can't get more power into the buildings".

From the point of view of rival co-location centres, probably, power problems in London are good news - more refugees fleeing the congestion means more customers. But there's another problem, and that's the spectre of "points of failure". And in theory, the internet ignores single failures, and routes round them. In reality, people have been cutting corners.

One BT engineer expressed his frustration: "It isn't the company I joined ten years ago. Then, we did things which needed to be done. Today, there's the collapse of a whole raft of ISPs all connected to the internet through a single exchange in Stepney, East London. Thieves stole the switches, and for nearly 24 hours, all those ISPs and their customers were off the Web. It should not have been possible, but it happened. And there are other examples which employees like me can't talk about publicly... but we all know where they are."

His fear, and the fears of others in big networks, amounts to a stark prediction: That if we try carrying on the way we are going, the system will start fracturing. Byte-outs which take thousands of internet users offline for days or even weeks at a time, will start becoming more frequent.

Intel recently did a test on power consumption in a big server-switch farm, on the idea that power might be saved on cooling.

The problem with cooling in a big co-lo is that the new generation of hardware runs much faster by dint of using a lot more power. But it also generates a lot more heat, and then the operators need to spend even more power on cooling. The Intel experiment suggested that perhaps, we're over-cooling - perhaps, having the data centres so cool that humans have to wear sweaters, we could use ordinary ambient air at ambient temperatures. How about (said the experimenters) taking the cooling system right offline, and only starting to chill the air if it was higher than 90 deg F (32 deg C)? Yes, we'd have to run the cooling fans in the racks faster, but that wouldn't be significant...?"

The experiment wasn't really necessary, say some. One switch manufacturer confided that "in many of the data centres our customers operate in China and other parts of equatorial Asia, this is already standard practice. Many of them simply couldn't afford to cool their climate down to accepted custom and practice as seen in London or New York, and temperatures are a lot higher in those centres. But the difference between knowing that the computers will work OK at higher temperatures, and knowing what the cooling costs, was important. The Intel experiment saved millions of pounds in power."

More to the point, from the perspective of big city co-los, it saved power, meaning that people can continue to run servers and switches when they might otherwise have had the plug pulled.

Logic says that you plan for this. You work out what the biggest demand spike will be, what the hottest day might be, and provide power for that. Then, when you get people saying "Please put another five servers in for us" you say "No, that will take us over our safety margin for hot days on demand peaks" and turn them down.

Obviously, as temperatures rise, the power drain rises. But the logic which says "Always limit your exposure to this risk" isn't one which the marketing department wants to hear. "Surely we can take a few more systems in? Do you want our customers to take their business to Maidenhead? Have we in fact had any days when high temperatures and unusually high demand coincided? Aren't you exaggerating the possible risks? You techies..."

London has had an unusually cold summer. Some observers have suggested that if the weather had been like it was in the hottest years, when the region suffered long-term droughts and old people had to be taken to hospital suffering from hyperthermia, we'd already have seen large-scale equipment switch-offs. Others say this is nonsense - scare tactics from equipment vendors. But the careful and experienced are moving.

There's a lot we could do to reduce power drain. For example, Extreme Networks say that they have figures showing that most Ethernet ports are using up to 40 watts when powered up - and they are powered up even when there's no traffic going through them. "Monitoring power to the device means two savings: First, we know what the device is, and how much power it needs, so we don't let Power over Ethernet (PoE) waste energy by over-supplying those devices which are low-power items. And also, we can tell when there's nothing attached to the port, and turn power to it off."

It's also been suggested that there are a lot of old, unused servers in data centres. "Nobody knows what they do, and nobody is prepared to say that if nobody knows, maybe they should be turned off," said one centre technician. "Some of them are antiques, generating enormous amounts of heat, which could be easily replaced by one new piece of kit which would do the work of dozens of those old ones, and use half the power of any one of them."

One supplier told me that his estimate was that as many as 40 per cent of servers were unused in long-established data centres. Unused, but switched on.

There are new approaches, too. Clustering, for example. I'm expecting to hear of a startup (now in stealth mode pre-launch) which is building servers from boxes the size of a Rubik's cube, running off a tenth of the power needed to operate a quad-core AMD or Intel box. "The power-MIPS curve can be changed with these things. As the demand rises, instead of the power rising exponentially, you just switch in another micro-server," said a source which knows the product plans.

The question of why people continue to expand their server designs when they must be aware of the problems they cause is an interesting one. "I know a customer who buys hundreds of servers a year, who approached one of the two big chip makers and said: "Design us a lower power server" and was told: "No, that's not in line with our strategy."

Naturally, once the ceiling is breached and angry internet users are being evicted from cyberspace panic will ensue, and steps will start to be taken to bail out the co-lo centres. It will be rushed, it will cost far more than it should, and it will be impossible to do quickly anyway. As with the credit crunch, the people responsible could have predicted and avoided the problems if they'd started planning five years ago.

So the real question is: Are we taking the problem seriously now? And if not, shouldn't we be? ®