Feeds

Fat-fingered admin downs entire Joyent data center

Cloud operator now home to most mortified sysadmin in the USA

Next gen security for virtualised datacentres

Updated Cloud operator Joyent went through a major failure on Tuesday when a fat-fingered admin brought down an entire data center's compute assets.

The cloud provider began reporting "transient availability issues" for its US-East-1 data center at around six-thirty in the evening, East Coast time.

"Due to an operator error, all compute nodes in us-east-1 were simultaneously rebooted," Joyent wrote. "Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time. We are dedicating all operational and engineering resources to getting this issue resolved, and will be providing a full postmortem on this failure once every compute node and customer VM is online and operational. We will be providing frequent updates until the issue is resolved."

The problems were mostly fixed an hour or so later.

For those not familiar with the cloud, a datacenter-wide forced reboot on all servers is just about the worst thing that can happen to a provider aside from the deletion of customer data, or multiple data centers going down simultaneously.

"While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter," explained Joyent's chief technology officer Bryan Cantrill in a post to Hacker News. "As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future".

Joyent has service-level agreements in place that will compensate customers for downtime, we understand.

In going through such a stomach-churning fault, Joyent has joined an illustrious group of service providers that includes Rackspace, Microsoft, Google, and Amazon which have all had similarly catastrophic failures.

"Anything that allows you to administer many, many machines will allow you to do this," Cantrill told The Reg in a phone conversation. "There was a silver lining here in the sense it was an opportunity to see how the system behaved. There are lots of ways it could have been much worse."

Joyent will try to learn from the experience and will publish a full post-mortem as well.

As for the fat-fingered administrator? "The operator that made the error is mortified, there is nothing we could do or say for that operator that is going to make it any worse, frankly," Cantrill said.

Nor would Joyent want to, he explained. The goal for the company is to learn from the problem and get better, not mete out punishment. "You don't teach dolphins with a shock collar," Cantrill explained.

Update

Joyent has now published a post-mortem on the incident.

The cause of the outage was that an admin was using a tool to remotely update the software on some new servers in Joyent's data center and, when trying to reboot them, accidentally rebooted all of the servers in the facility.

"The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter," Joyent wrote. "Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay."

At this point we imagine the operator emitted a high-pitched "oh dear oh dear oh dear" before watching the inevitable brown-out occur.

Bringing the system back online took such a long time because the rebooted servers all flooded the boot infrastructure with configuration requests, Joyent explained.

"First, we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously," Joyent said. "We want to reiterate our apology for the magnitude of this issue and the impact it caused our customers and their customers. We will be working as diligently as we can, and as expediently as we can, to prevent an issue like this from happening again."

El Reg would like to commend Joyent for its transparency about the outage and has made one virtual Sorry You Borked A Bit Barn pint available to the operator that caused the error. Interested parties can provide additional pints by selecting the beer icon in the comments below. ®

5 things you didn’t know about cloud backup

More from The Register

next story
The Return of BSOD: Does ANYONE trust Microsoft patches?
Sysadmins, you're either fighting fires or seen as incompetents now
Microsoft: Azure isn't ready for biz-critical apps … yet
Microsoft will move its own IT to the cloud to avoid $200m server bill
Oracle reveals 32-core, 10 BEEELLION-transistor SPARC M7
New chip scales to 1024 cores, 8192 threads 64 TB RAM, at speeds over 3.6GHz
Docker kicks KVM's butt in IBM tests
Big Blue finds containers are speedy, but may not have much room to improve
US regulators OK sale of IBM's x86 server biz to Lenovo
Now all that remains is for gov't offices to ban the boxes
Gartner's Special Report: Should you believe the hype?
Enough hot air to carry a balloon to the Moon
Flash could be CHEAPER than SAS DISK? Come off it, NetApp
Stats analysis reckons we'll hit that point in just three years
Dell The Man shrieks: 'We've got a Bitcoin order, we've got a Bitcoin order'
$50k of PowerEdge servers? That'll be 85 coins in digi-dosh
prev story

Whitepapers

Endpoint data privacy in the cloud is easier than you think
Innovations in encryption and storage resolve issues of data privacy and key requirements for companies to look for in a solution.
Implementing global e-invoicing with guaranteed legal certainty
Explaining the role local tax compliance plays in successful supply chain management and e-business and how leading global brands are addressing this.
Top 8 considerations to enable and simplify mobility
In this whitepaper learn how to successfully add mobile capabilities simply and cost effectively.
Solving today's distributed Big Data backup challenges
Enable IT efficiency and allow a firm to access and reuse corporate information for competitive advantage, ultimately changing business outcomes.
Reg Reader Research: SaaS based Email and Office Productivity Tools
Read this Reg reader report which provides advice and guidance for SMBs towards the use of SaaS based email and Office productivity tools.