OVH goes TITSUP again while trying to fix its last TITSUP
Attempt to harden network failed, badly, so the call's gone out to Cisco for help
Posted in Cloud, 7th December 2017 00:24 GMT
European web hosting outfit OVH has reported its second major outage and Total Inability To Support Usual Performance* in a month and admitted the new outage was caused by its attempts to fix the cause of the last one.
OVH's attributed its November outages to power problems and cable cuts.
But this incident notice filed by CEO and founder Octave Klaba on Wednesday 6 December stated “the problem was related to a software bug on the equipment we use which caused the deletion of the configuration.”
The notice continued: “Since then we have updated the equipment on everything our network. Also to prevent this type of bug from never [sic] again causes a worry [sic] about our DCs, we have decided to divide equipment clusters into 3 on the RBX website. So, if we ever have again this bug, the configuration would only impact 30% traffic.”
What did OVH learn from 24-hour outage? Water and servers do not mixREAD MORE
The company planned to change to that new regime late on 6 December, 2017, European time. But the changeover to the new systems has failed and caused connectivity problems and outages in Europe and beyond.
“During the preparation of the maintenance that was to start at 23:00, the configuration disappeared again at 8:20 pm and all the links were down again!!!!!” Klaba's notice said.
Those are Klaba's exclamation marks, by the way. It's understandable he used so many because the next sentence is: “The database has been deleted while we are using the latest software version. So there is another bug!”
Next step? “We look with Cisco to understand why all the links are not UP while the configuration was delivery to RBX.”
The outage has made for an ugly Status page at OVH, as depicted below from the time of writing.
OVH's Status Page: red lights everywhere. Click here to embiggen
If there's a small piece of upside in this incident, it's that it struck late in the European evening and continued into the small hours, times when traffic is low and some customers may not notice massive impact on their operations.
But there will also be plenty who were impacted, and irritated, and wondering why they give their business to a company that has also experienced flood damage and can't configure routers well enough to avoid this sort of thing.
OVH has promised to send The Register a statement about the incident. ®
* Total Inability To Support Usual Performance = TITSUP