Webhost in five day server FAIL
HostV's virtual private server node goes very private
Ensure Ease of Recovery with Asigra’s Agentless Software
Updated Update: This story has been updated with additional facts from Cirtex CEO John Xie.
New York-based webhost HostV - a division of Cirtex - is five days into an server node outage that has left customer websites completely inaccessible.
London-based Register reader Alan Ayoub says the outage has brought down 10 of his sites, and many others are complaining of downed sites in the HostV forums here.
Ayoub's sites have been inaccessible since Thursday. "People's business and livelihoods are going down the toilet," he says.
Cirtex CEO John Xie tells us that the outage has affected thirty to forty customers. Cirtex's HostV division offers virtual private server (VPS) as well as dedicated server hosting. On February 2, with a Twitter post to a feed that provides server status updates, the company indicated that its VPS infrastructure was under attack. "We are experiencing some serious issues," the post read. "It seems like some kind of attack on our servers. Several nodes are down at the moment."
Over the next 21 hours, regular Tweets alerted customers to failures and repairs of various server nodes. Then, at midday on Thursday, the feed went silent. According to the last three Tweets, one server node was still down, and it seems the failure was related to RAID problems rather than some sort of server attack.
"NODE-16 is still offline at the moment restoring from backup, we apologize for the inconvenience and are doing our best to restore service," the feed said, before going silent.
A day later, in a post to the HostV forums, a company representative said Node 16 was still down but that the staff was in the process of restoring data.
"The node16 is under maintenance due to hardware issues," the post read. "We are currently restoring data on node16. All Os ad [sic] related files have been restored. Now we are restoring vps's data. We hope everything will be setup and fine very soon. We appreciate your patience in this. Currently 28% has been restored."
But four days later, the node is still down.
The company's last public post came this Tuesday morning. "As for an ETA on the restore, we don't have one," a company representative said. "However, it does look like its going to take a fair amount of time. More than 24 hours. Possibly more than 48 hours."
In this post, the company offered to set up affected customers on another node in the meantime, but this would not provide access to data. "That would be pretty much useless," Ayoub tells The Reg. "We still won't have our websites."
Ayoub - like other customers - complains that the company has been far too slow to provide updates on the situation, with as much as 24 hours passing between notices. And he's worried the data restore won't be successful. ®
Update
Citrex CEO John Xie says that the node in questions had a hardware RAID failure after it was rebooted for a server patch. "It caused so much corruption from the single RAID Card failure that we had to restore from backup," he says.
But the company is also having problems with its backup system, from R1soft. "The restore process is still going on, and the main issue is the speed of restore from encrypted and protected files, "We're closely working with R1soft developers...to bring this one particular node back online," he continues.
"There are no excuses for this but we have already offered a refund and migration for all clients on this server after this restore has been completed."
Update 2
Xie adds that HostV's data on the node in question "has some corruption in the backup that is causing glitches for baremetal restore, so we're pursuing manual restore through R1Soft."
COMMENTS
Ah, backups
Most people don't start making backups (including local copies of remotely hosted content) until they've suffered catastrophic data loss.
Then they don't start testing the integrity of their backups until they've suffered further catastrophic data loss.
RAID fail?
Maybe if they're doing business-class hosting, they should be using some clustering technologies underneath so that a single server doesn't cause these types of problems. N+1 or N+2 are pretty standard scenarios for business-class services.
If you're hosting in a virtualized environment, you should be using the technology correctly, including shared storage and high availability. A single server shouldn't take out your hosting environment.
webfusion are the worst
Webfusion recently migrated to a new datacentre in Nov - entire nodes of VPS (including my client sites) were offline for a week. Then there were still issues about it being in the wrong container (eg running Win2003 SP1 but being placed in a Win2003 set) as there were functionality issues on the machine
This was raised for a support ticket (24/7 support my arse) and they "investigated" meaning they then knocked the server (completely inaccessible) offline for 27 days. Managed to get it online after a second migration and it's completely wiped. No configuration, no data, nothing.
As of yet, not apologies, no explanation and no compensation.

IT infrastructure monitoring strategies
Agentless Backup is Not a Myth
Steps to Take Before Choosing a Business Continuity Partner
Requirements Checklist for Choosing a Cloud Backup and Recovery Service Provider
Data control in the cloud