One titsup server kills Brit-hosted Donhost websites for THREE DAYS
Where was the back up?
Cloud storage: Lower cost and increase uptime
A fault in just one server at Brit web hosting biz Donhost took out thousands of websites and emails for more than three days.
The service slowly found its feet again on Wednesday afternoon after the company officially confirmed it fell over on Monday at 7am. Affected customers posting in a help forum put the start of the downtime even earlier - saying they first noticed problems on Sunday night.
Email processing was knackered, too. Donhost assured folks that no messages were lost, though users reported that emails to the affected domains were bouncing.
Donhost, a trading name for British company Webfusion Ltd, fessed up to the cock-up on its service status page:
Server 50 connectivity issues: Due to a full system failure our administrators have been unable to recover the server to deliver websites and associated services.We have created a new server to host the sites and services on and our system administrators are currently restoring all sites from our fail safe backups.
Donhost pinpointed the fault in just one server - doomed server 50 - leaving many users perplexed by how one machine could cause so much trouble and why wasn't there sufficient backup kit in place.
According to posters in the forums, Donhost recommended new resellers move to Heart Hosting - another Webfusion-owned server business. Donhost primarily offers premium business hosting, dedicated servers, and sells its services to clients that resell it to punters.
Donhost say that no material has been lost, the flaw has been fixed, and email handling will return to normal.
We asked parent company Webfusion for a comment on why the problem took so long to fix and what assurances it can offer customers. We will update this piece if they get back to us. ®
Updated to add
A Webfusion spokesman has been in touch to say: "It was a technical failure that resulted in a complete system overhaul, which regrettably took longer than we had anticipated. All customers are now fully operational and our teams are helping customers on a case by case basis."
COMMENTS
Maximum incompetence
Not much else to say, really.
Maybe once they're laid off, they can find a job at Microsoft, managing Azure, they'd fit right in.
Clarificaton from Webfusion ..
How long does it take to swap out the hardware and restore from last nights backups. I would suspect that most of the delay was in finding someone technical enough to do the job, as they fired most/all of the technical staff - to save money.
"It was a technical failure that resulted in a complete system overhaul, which regrettably took longer than we had anticipated"
That's totally cleared up the issue for me.
Re: Was it a windows server?
I have to agree, this is a damagement issue.

IT infrastructure monitoring strategies
What you need to know about cloud backup
Enabling efficient data center monitoring
Agentless Backup is Not a Myth
Top 10 SIEM implementer’s checklist