Feeds

Fat-fingered admin downs entire Joyent data center

Cloud operator now home to most mortified sysadmin in the USA

Choosing a cloud hosting partner with confidence

Updated Cloud operator Joyent went through a major failure on Tuesday when a fat-fingered admin brought down an entire data center's compute assets.

The cloud provider began reporting "transient availability issues" for its US-East-1 data center at around six-thirty in the evening, East Coast time.

"Due to an operator error, all compute nodes in us-east-1 were simultaneously rebooted," Joyent wrote. "Some compute nodes are already back up, but due to very high load on the control plane, this is taking some time. We are dedicating all operational and engineering resources to getting this issue resolved, and will be providing a full postmortem on this failure once every compute node and customer VM is online and operational. We will be providing frequent updates until the issue is resolved."

The problems were mostly fixed an hour or so later.

For those not familiar with the cloud, a datacenter-wide forced reboot on all servers is just about the worst thing that can happen to a provider aside from the deletion of customer data, or multiple data centers going down simultaneously.

"While the immediate cause was operator error, there are broader systemic issues that allowed a fat finger to take down a datacenter," explained Joyent's chief technology officer Bryan Cantrill in a post to Hacker News. "As soon as we reasonably can, we will be providing a full postmortem of this: how this was architecturally possible, what exactly happened, how the system recovered, and what improvements we are/will be making to both the software and to operational procedures to assure that this doesn't happen in the future".

Joyent has service-level agreements in place that will compensate customers for downtime, we understand.

In going through such a stomach-churning fault, Joyent has joined an illustrious group of service providers that includes Rackspace, Microsoft, Google, and Amazon which have all had similarly catastrophic failures.

"Anything that allows you to administer many, many machines will allow you to do this," Cantrill told The Reg in a phone conversation. "There was a silver lining here in the sense it was an opportunity to see how the system behaved. There are lots of ways it could have been much worse."

Joyent will try to learn from the experience and will publish a full post-mortem as well.

As for the fat-fingered administrator? "The operator that made the error is mortified, there is nothing we could do or say for that operator that is going to make it any worse, frankly," Cantrill said.

Nor would Joyent want to, he explained. The goal for the company is to learn from the problem and get better, not mete out punishment. "You don't teach dolphins with a shock collar," Cantrill explained.

Update

Joyent has now published a post-mortem on the incident.

The cause of the outage was that an admin was using a tool to remotely update the software on some new servers in Joyent's data center and, when trying to reboot them, accidentally rebooted all of the servers in the facility.

"The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the datacenter," Joyent wrote. "Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay."

At this point we imagine the operator emitted a high-pitched "oh dear oh dear oh dear" before watching the inevitable brown-out occur.

Bringing the system back online took such a long time because the rebooted servers all flooded the boot infrastructure with configuration requests, Joyent explained.

"First, we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously," Joyent said. "We want to reiterate our apology for the magnitude of this issue and the impact it caused our customers and their customers. We will be working as diligently as we can, and as expediently as we can, to prevent an issue like this from happening again."

El Reg would like to commend Joyent for its transparency about the outage and has made one virtual Sorry You Borked A Bit Barn pint available to the operator that caused the error. Interested parties can provide additional pints by selecting the beer icon in the comments below. ®

Top 5 reasons to deploy VMware with Tegile

More from The Register

next story
'Kim Kardashian snaps naked selfies with a BLACKBERRY'. *Twitterati gasps*
More alleged private, nude celeb pics appear online
Wanna keep your data for 1,000 YEARS? No? Hard luck, HDS wants you to anyway
Combine Blu-ray and M-DISC and you get this monster
US boffins demo 'twisted radio' mux
OAM takes wireless signals to 32 Gbps
Google+ GOING, GOING ... ? Newbie Gmailers no longer forced into mandatory ID slurp
Mountain View distances itself from lame 'network thingy'
Apple flops out 2FA for iCloud in bid to stop future nude selfie leaks
Millions of 4chan users howl with laughter as Cupertino slams stable door
Students playing with impressive racks? Yes, it's cluster comp time
The most comprehensive coverage the world has ever seen. Ever
Run little spreadsheet, run! IBM's Watson is coming to gobble you up
Big Blue's big super's big appetite for big data in big clouds for big analytics
Seagate's triple-headed Cerberus could SAVE the DISK WORLD
... and possibly bring us even more HAMR time. Yay!
prev story

Whitepapers

Secure remote control for conventional and virtual desktops
Balancing user privacy and privileged access, in accordance with compliance frameworks and legislation. Evaluating any potential remote control choice.
Intelligent flash storage arrays
Tegile Intelligent Storage Arrays with IntelliFlash helps IT boost storage utilization and effciency while delivering unmatched storage savings and performance.
WIN a very cool portable ZX Spectrum
Win a one-off portable Spectrum built by legendary hardware hacker Ben Heck
High Performance for All
While HPC is not new, it has traditionally been seen as a specialist area – is it now geared up to meet more mainstream requirements?
Beginner's guide to SSL certificates
De-mystify the technology involved and give you the information you need to make the best decision when considering your online security options.