Feeds

Fat-fingered admin downs entire Joyent data center

Cloud operator now home to most mortified sysadmin in the USA

Security for virtualized datacentres

Updated Cloud operator Joyent went through a major failure on Tuesday when a fat-fingered admin brought down an entire data center's compute assets.

The cloud provider began reporting "transient availability issues" for its US-East-1 data center at around six-thirty in the evening, East Coast time.

"Due to an operator error, all compute nodes in us-east-1 were simultaneously rebooted," Joyent wrote. "Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time. We are dedicating all operational and engineering resources to getting this issue resolved, and will be providing a full postmortem on this failure once every compute node and customer VM is online and operational. We will be providing frequent updates until the issue is resolved."

The problems were mostly fixed an hour or so later.

For those not familiar with the cloud, a datacenter-wide forced reboot on all servers is just about the worst thing that can happen to a provider aside from the deletion of customer data, or multiple data centers going down simultaneously.

"While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter," explained Joyent's chief technology officer Bryan Cantrill in a post to Hacker News. "As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future".

Joyent has service-level agreements in place that will compensate customers for downtime, we understand.

In going through such a stomach-churning fault, Joyent has joined an illustrious group of service providers that includes Rackspace, Microsoft, Google, and Amazon which have all had similarly catastrophic failures.

"Anything that allows you to administer many, many machines will allow you to do this," Cantrill told The Reg in a phone conversation. "There was a silver lining here in the sense it was an opportunity to see how the system behaved. There are lots of ways it could have been much worse."

Joyent will try to learn from the experience and will publish a full post-mortem as well.

As for the fat-fingered administrator? "The operator that made the error is mortified, there is nothing we could do or say for that operator that is going to make it any worse, frankly," Cantrill said.

Nor would Joyent want to, he explained. The goal for the company is to learn from the problem and get better, not mete out punishment. "You don't teach dolphins with a shock collar," Cantrill explained.

Update

Joyent has now published a post-mortem on the incident.

The cause of the outage was that an admin was using a tool to remotely update the software on some new servers in Joyent's data center and, when trying to reboot them, accidentally rebooted all of the servers in the facility.

"The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter," Joyent wrote. "Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay."

At this point we imagine the operator emitted a high-pitched "oh dear oh dear oh dear" before watching the inevitable brown-out occur.

Bringing the system back online took such a long time because the rebooted servers all flooded the boot infrastructure with configuration requests, Joyent explained.

"First, we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously," Joyent said. "We want to reiterate our apology for the magnitude of this issue and the impact it caused our customers and their customers. We will be working as diligently as we can, and as expediently as we can, to prevent an issue like this from happening again."

El Reg would like to commend Joyent for its transparency about the outage and has made one virtual Sorry You Borked A Bit Barn pint available to the operator that caused the error. Interested parties can provide additional pints by selecting the beer icon in the comments below. ®

Providing a secure and efficient Helpdesk

More from The Register

next story
Docker's app containers are coming to Windows Server, says Microsoft
MS chases app deployment speeds already enjoyed by Linux devs
IBM storage revenues sink: 'We are disappointed,' says CEO
Time to put the storage biz up for sale?
'Hmm, why CAN'T I run a water pipe through that rack of media servers?'
Leaving Las Vegas for Armenia kludging and Dubai dune bashing
Facebook slurps 'paste sites' for STOLEN passwords, sprinkles on hash and salt
Zuck's ad empire DOESN'T see details in plain text. Phew!
Windows 10: Forget Cloudobile, put Security and Privacy First
But - dammit - It would be insane to say 'don't collect, because NSA'
Symantec backs out of Backup Exec: Plans to can appliance in Jan
Will still provide support to existing customers
prev story

Whitepapers

Forging a new future with identity relationship management
Learn about ForgeRock's next generation IRM platform and how it is designed to empower CEOS's and enterprises to engage with consumers.
Why and how to choose the right cloud vendor
The benefits of cloud-based storage in your processes. Eliminate onsite, disk-based backup and archiving in favor of cloud-based data protection.
Three 1TB solid state scorchers up for grabs
Big SSDs can be expensive but think big and think free because you could be the lucky winner of one of three 1TB Samsung SSD 840 EVO drives that we’re giving away worth over £300 apiece.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.
Security for virtualized datacentres
Legacy security solutions are inefficient due to the architectural differences between physical and virtual environments.