3,500 servers go down – so my FIRST AID training kicks in
Night shifts have their upside – including the money
On-call Welcome again to On-Call, our getting-slightly-more-regular look at Reg readers professional escapades at odd times of the day or night, usually in odd places.
This week's tale comes from Pat Dufresne, who tells us he spent two years on the night shift at a hosting company in Montreal, Canada, and rates the gig “the best two years I've spent working in IT".
It was quiet at night, his boss wasn't there, things seldom went awry and his body clock was in synch with Montreal's party scene. In fact, Pat even scored a ten per pay top-up to compensate for the unsociable hours.
“It made for a rather pleasant experience overall, that I would not trade in for anything,” he told us. “I was also senior on the shift and in charge of all the large and VIP clients."
Of course it couldn't last.
“I walk in around 23:45,” Pat recalls. “The senior from the previous shift comes over to me in a panic, looks at me and says 'good luck, I'm out of here' and hurriedly walks out the door. I had just walked into a complete clusterfuck, coffee in hand, and I didn't even realize it.”
“We had just lost a third of our main data centre not 10 minutes earlier because the transfer switch from grid to generator didn't trip. The generator was running fine but 3,500 servers were down. The grid died because of a fire in another part of the grid (close to the data centre) and firefighters called the Electrical Company to cut the power for safety reasons.”
“In one foul generator failure I went from simple Sysadmin to Crisis Manager, Customer Service Manager and Team Lead.”
“Luckily the previous shift Sysadmins, and first level phone guys along with the overnight shift sysadmins and level one support guys, were with me. I quickly got to work structuring everyone and moving from panic mode to something a bit ... calmer."
First level support folk were assigned to live chat and phone support. Sysadmins scored the job of confirming which servers were down and working with Pat and the data centre.
Pat chatted to VIP and large clients, co-ordinated comms between the customer support centre and the data centre, and figured out how to prioritise support so the data centre could bring servers back online to satisfy customers and keep morale up while we the company was taking angry calls from literally every part of the world.
“It helped that I had first aid, leadership and crisis training from being a first aid responder when I was 16,” Pat said.
“At around 04:30 (five hours after the transfer switch failure) the first representative of management showed up to the office as if we were an afterthought and started to assert themselves. Problem is, the situation is under control and already managed. We're in over our heads in tickets, but otherwise still swimming along. Cue hiding in the office without a peep until day shift comes in.”
“In the end, we reinstalled at least 1,000 servers because when the power was re-established, a lot of hard drives died. I finished my day around 13:00 and when I looked at the stats, I had done three per cent of total ticket communications in that period, which was just shy of 3,000 interactions (messages).”
Pat now rates the incident “by far the most fun I've had working as a sysadmin”, even more than the day on which he had to “reconfigure a client's network twice in one day (16-hour shift) because of a massive 20Gbps DDoS attack (saturating 2 of our Level 3 interlinks)”.
What's your idea of fun at work? We're always on the lookout for tales of on-call action, plus expat experiences when you move to another country. Share your tale by contacting me with this form. ®