Everest outage was caused by split brains
Bit barn says routers in stack ran different software versions
Updated Server farm Everest''s blackout on 15 November was caused by a power outage combined with stacked routers each running different software versions.
A "reason for outage" document issued by Everest admitted to there having been a "loss of connectivity" for clients using IP network services between 0830 and 1030 on 15 November.
The trigger of the outage was a power failure, which Everest said it was investigating separately. However, the killer moment happened after techies started trying to restore services:
When power was restored, it turned on and off a few times in quick succession, which caused the routers to enter a failure recovery mode that needed manual intervention.
Everyone was back up and running by 1031, with the incident being declared over by 1130.
The report (PDF) explained how Everest's Reading data centre is split into four data halls, and how data hall 2 (DH2) was the one that was affected. Its Juniper aggregation routers – two stacks, each made up of an EX4550 and an EX4200 combined into one – lost power.
When the wiggly amps started flowing again after three minutes, they promptly stopped after 11 seconds. After five minutes the juice came back – before giving up for another eight as power was fully restored.
The software problems were described by Everest thus:
When the Juniper EX4550 and EX4200 in each stack booted, they could not correctly form a stack due to different software versions. In a stack in normal operation, one device will be elected as "master" and other devices as "slaves". When they have dissimilar software versions they will not form a stack, and there will be a split-brain scenario and loss of connectivity.
Engineers powered up the EX4550 first, ensuring it was running smoothly as the master device before powering on the EX4200 as the slave.
They later ensured that all routers in the aggregation stack were running the same software version, so as to ensure automatic reboots following a power supply failure would work.
"Clients with layer 2 services, wavelength services (as per the RFO), and anyone taking network services from other carriers were unaffected," clarified Ed Butler of Everest DC.
The power outage hit some of Everest's data centre customers including Gov.UK's cloud computing provider Memset.
El Reg asked why the UPS didn't kick in. Memset's deputy CEO Annalisa O'Rourke explained: "Despite a rigorous testing regime for the generators at Everest, the generators started but failed to maintain the load placed upon them. Every expectation was that they are specified to manage the full load and capacity of the data centre for an extended period.
"We apologise to all our customers for the outage and the impact on their businesses. We would also like to relay our appreciation for the consideration and understanding they showed whilst we were dealing with the incident."
O'Rourke later said that Memset was "more than happy with the specification and suitability of the generators" at Everest. ®