BOFH: Power corrupts, uninterrupted power corrupts absolutely
The best things in life are free, or at least come when there's no charge
"In the old days, when we had around 40 servers with a total power requirement of about 30 kVa – including Mission Control – the 100kVA generator fed the two redundant 30kVA UPS units with sufficient capacity left over to power one of the lifts," I reply.
"These days," the PFY says, getting in on the act, "with the number of servers we have, with their development, test, pre-deploy and production instances – even WITH virtualised platforms – there's barely enough capacity left over to power the La Marzocco GB/5."
"It's a CMS," I say, not muddying the water by telling the boss that in this instance, the C stands for coffee. "So because we're borderline like this, we've pulled power back to the absolutely crucial services – that way we get a chance to shut things down in a stable fashion, while maintaining a work platform from which to launch recovery services."
"So our UPS and generator don't work?"
"They would work perfectly well until they're needed, at which point the two – now non-redundant – 30kVA UPS units trip into bypass, being unable to deliver the 40-odd kVA that's demanded of each of them. The power goes out because the generator has a 30-second start delay built in and hasn't autostarted yet. About 25 seconds later, the generator autostarts, runs up until the power quality breaker is energised and determines that the voltage has been consistently stable for 10 seconds, at which time it trips the breaker to cutover from the dead mains to our generator supply. The breaker trips, the generator internal contactor shits itself – because the start current demand from everything downstream of the UPS is about twice the capacity of the generator – and the contactor opens, but not before sending about 1/2 a wave of around 160v down to every piece of downstream equipment, with a big spike when it chops off. With the contactor open, the power quality breaker de-energises open so that when the mains comes back on, the power to the equipment will too. Only the generator is still running at no load, and the contactor – which is a combination electromagnet and bimetallic thermal shutoff – will reset itself within 8 to 10 seconds. The power quality breaker sees a clean supply, cuts over, the contactor trips, the power goes off with another spike, the contactor resets and the process repeats."
"The equipment which isn't ruined by the spikes," the PFY says slowly, "is usually the stuff which is 'protected' by crap rack breakers which can't handle the start current either. But it's always the cheapest stuff which is in those racks: the good equipment is always in dual-fed racks with D-curve breakers which are more than happy to pass spikes, even though they claim to have surge protection. The bad news is that with those crap racks out, the contactor trip voltage now rises higher – sometimes as high as 220v – so the spike voltage increases with every reset."
"What can we do?" the Boss gasps. "Can we get a bigger generator and bigger UPS units?"
"The generator for this place is on the roof and it came with the building. About the only thing that could get that off the roof is an earthquake – and then it would only travel vertically."
"So what do we do?"
"We rate the services we consider most critical and keep them running. The rest we shut down."
"How do we choose?"
"I guess we discuss it. Over a pint. At the pub. Now?"
"At the pub?"
"Well nothing's happening here, is it? And with a dead transformer we're talking about at least a day to reinstate. So lets have a couple of quiet ones and discuss priorities."
"I..." the Boss says, thinking about it but seeing no alternatives. "I guess so."
Now we just need to ensure we get several pints into him before he realises that (a) we're not paying for our drinks (b) the pub is the only building on the block apart from Mission Control with power and (c) there's a thick cable which seems to join our two buildings.