The silence of the racks is deafening, production gear has gone dark – so which wire do we cut?
On Call Hit reset on the working week for Friday has arrived and with it another entry in The Register's long list of on-call shenanigans.
Today's story, from a reader The Reg's patented anonymiser has elected to call "Jon", is a cautionary tale for those tasked with keeping the data centre lights on.
Jon's employer had added some new racks in a tier-one data centre. While the DC naturally had its own generator to keep the power flowing, the bigwigs were worried that things would not kick in fast enough in the event of an outage so also had fitted in-rack Uninterruptible Power Supplies (UPS).
The UPS vendor? A popular one that Jon and the team referred to as "Armenian Potato Clocks".
"Because the production equipment was most important, we naturally only protected the production racks," Jon told us. "The DC vendor allowed us to install the UPS units, but due to a fire code required us to connect to an Emergency Power Off (EPO) circuit."
Jon was not aware of the latter, and subsequently significant, fact until much, much later.
When the IT department speaks, users listen. Or face the consequencesREAD MORE
The call came in at the end of a Friday: "I got an alert that one of the UPS's had failed, so I decided to head down to the DC to check it out."
"Sure enough," said Jon, "the unit was emitting a beleaguered high-pitched whine, so I decided to reset it.
"Where once there had a been a great deal of white noise from the whirring of fans and disks, suddenly there was great, terrible silence."
Every production rack had gone dark. "I took a look around the rest of our facility and found the same thing everywhere." The Unix machines, Windows boxes, SAN, switches, everything was dead.
Naturally, the non-production gear was still merrily humming along, but that was of scant comfort to Jon, who "fruitlessly tried to restart the UPS's, which remained stubbornly offline".
Jon made the call and got all hands on deck to work out what could have happened. "My manager came around, took a look at the back of the UPS, and saw a tell-tale indication of smoke residue on the back coming from the EPO junction."
This was the moment Jon learned about the EPO circuit.
The DC electricians were summoned "and determined that the failed UPS was shorting back into the EPO circuit… which of course was shared by all the other UPS's, which were all receiving an emergency power-off signal."
Whatever borkage had transpired within APC's finest had caused the EPO wire to fuse internally, so it couldn't be simply yanked out.
"I still remember a bunch of us standing around the back of the UPS while an electrician held a pair of wire cutters to the EPO wire like he was disarming a bomb in a terrible action movie, asking us 'Are you sure you want to do this?'"
We look forward to a future Bruce Willis actioner in the form of "Die Hard with a UPS".
With fingers crossed and, we fervently hope, some dramatic music rumbling in the background, the cable was snipped and… "everything powered right back up."
After the gang spent some quality time checking that nothing had been left too unhappy by the surprise boot and was back online, Jon was interrogated over the sequence of events that lead to the production outage.
"Some short time later, the UPS gear was removed."
Ever hit reset before the fragrance of burning insulation reached your nostrils? Or realise a tad too late that a critical bit of configuration information was missed from your training? We have, and have the burnt-out power supply to show for it. Share your experience via an email to On Call?
Where were you 20 years ago? Were you frantically cutting COBOL or adding a crucial extra byte or two to a date field? Or a bodge that might last to, oh, 2050 before it explodes? Who, Me? and On Call would also like to hear your sordid Y2K tales for a festive feast of near-failures and dodged bullets. ®