AWS stops some EC2 servers without warning
‘Retirement’ notifications not entirely accurate or reliable
If you’re thinking about heading to the cloud for über-reliability and an environment in which anything that happens to hardware is someone else’s problem, think again: Amazon Web Services sometimes replaces the hardware virtual servers run on and switches those servers off without elegant or accurate notifications of what’s about to happen.
AWS calls this ‘Instance retirement’ and makes it happen when the physical server an elastic compute cloud (EC2) instance runs on ‘degrades’ and is in danger of experiencing hardware failure. Which is very useful indeed, but also a little worrying as the cloud company does not always retire instances elegantly.
Social marketing analysis firm awe.sm recently blogged about the problem here , describing the symptoms as follows:
“Virtual hardware doesn’t last as long as real hardware. Our average observed lifetime for a virtual machine on EC2 over the last 3 years has been about 200 days. After that, the chances of it being ‘retired’ rise hugely. And Amazon’s ‘retirement’ process is unpredictable: sometime they’ll notify you ten days in advance that a box is going to be shut down; sometimes the retirement notification email arrives 2 hours after the box has already failed.”
A trawl through AWS’ support forums suggests that the company isn’t switching off servers without notifications every day, but threads pop up quite regularly in which users complain about servers disappearing.
One such thread (which we shan’t link to because it includes some personal details), complains that “One of our EC2 instance[s] hung and retired an hour before receiving notification from AWS.”
Such an incident would not be entirely painful if the user in question’s instance used Elastic Block Store (EBS), as users with that arrangement need only stop and restart the instance and it will resume operations on new hardware. Performing that action takes mere minutes, so if the hang didn't interrupt important operations the disruption would be slight.
But users whose instances run an Amazon Machine Image from the instance store have a harder task before them. AWS emails on the topic say “If your instance's root device is an instance store, it will be terminated after the retirement date. We recommend that you launch a replacement instance from your most recent AMI and migrate all necessary data to the replacement instance before this time.” It’s also possible to convert AMI instances to EBS instances, but that’s a bit of a chore, as detailed here .
The need to revert to backups or convert to EBS instances is taking some users by surprise as they don’t have backups, as this thread  shows.
Of course those without backups have only themselves to blame. EC2 users are also notified of imminent retirement by the AWS console, so again there’s an element of personal responsibility that needs to be considered here ... although the fact that some of AWS’ retirement notifications seem not to be timely is bothersome.
It’s also worth noting that the retirement process isn’t well-documented – we could find only seven mentions of it in the AWS support database and it doesn’t get a mention in the EC2 FAQ, although the terms and conditions for AWS go out of their way to point out the service can experience interruptions.
None of which means that AWS is placing users in peril. But the fact that cloud servers can be halted without prior notification and in ways that require a fair bit of work to repair is surely something to take into account when considering just what the cloud means for your operations. ®