*Thunk* No worries, the UPS should spin up. Oh cool, it's in bypass mode
Power outage and no Plan B – hilarity ensues
This Damn War Whatever can go wrong will go wrong. It's a law most IT people would understand and perhaps even fear.
It was my third day as the new network manager for a reasonably sized estate across several sites, most inhabited by weirdy beardies who had jobs like counting bats, frogs and other animals you may never have heard of.
It was the epitome of public-sector wastage, supervising the BT engineer installing a new leased line because the stand-in contractor/IT manager had ordered a 100Mbit pipe for the wrong site. No matter, we still had to pay for the other one as we were contractually obliged for the next several years.
While chatting to him about the setup it all went dark. Worse, there were no more humming sounds. Being located next to the data centre, this meant bad things had happened.
People came out the offices, unsure what to do or what had happened. No one had the details for the local power company. No one had ever really thought about disaster planning and putting all the useful information into a useful format.
The only place with electricity was the local hospital. It was going to be a bad day. I hadn't even got to know most of the IT department as yet. Then it dawned on me: "Hang on, why did the UPS not take over?"
I got that sick feeling in my stomach. It transpired that the UPS was in bypass mode while waiting for a new board of some description. There had been no handover so I wasn't aware. The previous contractor left the Friday before I started because they were not prepared to pay for a single week of handover time. This was going to be ugly. A dirty shutdown on rack after rack of servers and storage arrays is not a good way to start the day.
Being the new IT manager, I asked for a copy of the company's emergency procedures. As you can guess, it didn't exist. There was no process or protocol. Even getting the message out to satellite sites proved difficult because Power-over-Ethernet phones failed to work.
It came to light that the company had only purchased the emergency phone units a few weeks earlier and none had been installed yet. All the site details were on desktop computers so we had to see if anyone in the helpdesk had a laptop with the site phone numbers. The helpdesk couldn't even answer the phones to the irate beardies who couldn't get to their obscure animal websites.
One of the biggest issues aside from communication was the fact that the DC was crowded with people but totally in the dark. Very little light in the room, except from two high-level windows. We had two torches and eight people. It was piss poor planning at its finest.
In between all this the Openreach engineer had disappeared – but that was the least of my issues.
Eventually power was restored and we had to turn the servers back on. One good decision we did make was to fully unplug the power from all the devices to prevent a spike/brown-out event on power restoration. Again it was a seat-of-my-pants effort to decide which servers were most critical and get them online.
Luckily the storage array came up after what seemed like an age. Disk checks on these servers took a fair old time and most of the machines were recovered without too many issues. We had the odd PSU and disk failure here and there but we got by relatively unscathed.
Perhaps just as important where the following lessons learnt – not just by me but the entire department:
- Have a plan to deal with most likely scenarios – i.e. total power loss – and work through it as though it had just happened. Have mobile phones charged and ready to go so you can get the word out. It may sound obvious but one of the big issues we had to work around was "What next" and making it up on the fly. It's the perhaps obvious questions that you have to work out, such as which servers are key.
- Hard copies are essential. What if you can't access that computer? Print out copies of all the site addresses, phone numbers and contacts and put them somewhere safe where everyone can get to them.
- Just fix it already. Although the issue as a whole was outside of our control, there were items that should have been dealt with, irrespective of who was in charge. Putting a UPS into bypass mode is not good but it shouldn't take months to sort out a new circuit board.
Had the UPS been functioning, there would have been enough power to shut the infrastructure down properly and get word out to the satellite sites.
I left a few years later but the last I heard the company had spent several million pounds on a new site built directly on a flood plain with the IT hardware in the basement.
Sometimes organisations just don't engage their brains – or it is someone else's problem. ®