'No BS' web host Gandi emits outage postmortem, has 'only theories' on what went wrong
Also reckons that it should 'accurately document the data recovery procedure' for metadata corruption. Y'think?
Hosting outfit Gandi has published its postmortem regarding this month's outage and concluded that while it still has "no clear explanation", the main problem was "the duration".
So that's OK then.
The mystery incident took down a storage unit in the company's Luxembourg facility at 14:51 UTC on 8 January. It wasn't until 13 January that data was restored and services were all back online, according to the postmortem published yesterday.
414 customers were impacted "at most", according to Gandi.
The problem was that after failover attempts failed, the company 'fessed up via its various social media orifices that customer snapshots could well be lost. The file system used, ZFS, allows these snapshots of disks to be taken, and a good number of customers had expected these to be preserved.
Not so, said Gandi, as its Twitter operatives twisted this way and that in justifying the company's take on things. It was, the outfit insisted, up to customers to maintain their backups.
The postmortem doubled down on this, stating: "Contractually, we don't provide a backup product for customers," before mumbling: "That may have not been explained clearly enough in our V5 documentation."
The technical timeline published would make a good candidate for The Register's Who, Me? or On Call columns, and while Gandi is to be commended for its honesty, the floundering of its technical team was palpable as the situation unfolded.
Though the discovery that the version of ZFS in use was too old to support some of the import options that turned up during frantic documentation searching is quite comical, those pointing smugly at their own storage and hosting setups would do well to take a careful look at Gandi's experience.
At least the story had a relatively happy ending (not counting the lengthy outage).
Unsurprisingly, Gandi plans to finish its ongoing upgrade of storage units to a newer version of ZFS and, in what will likely tip admins off their chairs, "accurately document the data recovery procedure in case of metadata corruption".
It added: "We have identified areas for improvement internally in order to be even more fluid and responsive in near-real time." We'd suggest a slightly more sympathetic approach to customers panicking over lost data and holding off on posting Game of Thrones gifs would be a start.
The actual cause of the metadata corruption that left those customers dangling remains a mystery. Gandi said it had ruled out fat-fingered keyboard jockeys, saying that it didn't have a clear explanation, "only theories".
Maybe it was the server RAM wot dunnit, the company wondered. We, in turn, wondered if they were using ECC memory. A company representative told us that, according to the company's self-proclaimed BOFH, the servers do indeed use the stuff.
Ultimately, it said: "We acknowledge the main problem was the duration."
Customers, likely surprised by the lack of a documented recovery procedure for duff metadata and the get-ready-to-restore messaging from Gandi, might not entirely agree. ®