Auto Recovery for Amazon EC2

CarlHoerberg · on Jan 19, 2015

Finally, it's crazy that they've haven't implemented this earlier, and why isn't it enabled by default, like on GCE? We've had for a long time an app that just polls the ec2 api and looks for impaired instances and then automatically restarts them. We have about 2-10 impaired/scheduled-for-reboot/on-deprecated hardware-instance per month so that app is quite a time-saver.

bmurphy1976 · on Jan 19, 2015

Please note that this is for EBS backed instances only.

If you want something similar for ephemeral instances, do what we do: min 1 max 1 auto scaling groups. We've found that Amazon is pretty good at catching bad instances and terminating them, although on occasion we do have to terminate an instance manually. The autoscaling group takes care of the rest.

oellegaard · on Jan 19, 2015

Heavy EC2 user here. This doesn't solve your problems, if you want to do this right, setup an EC2 Auto Scaling group and build an image each time you need to change your server. That is the proven way most large deployments work, including Netflix.

discodave · on Jan 20, 2015

Yes but most people don't have large deployments like Netflix. The main reason Neftlix and co build entire images is because if you're scaling up and down by thousands of instances then the extra time for those instances to do all the build, configuration and so on is material. So you're better to do it once and essentially CTRL+C CTRL+V.

If you're only scaling up a few instances at a time then your number one consideration is probably making life easy for yourself.

bkirkbri · on Jan 20, 2015

We tend to take an in-between approach, baking base AMIs as needed but configuring and installing code with Ansible at boot time. We've got scripts to cut tarballs with an Ansible playbook plus support files from any tag in our repo. Then it's just a matter of specifying the tag in user-data when starting an instance. We do that through CloudFormation.

edit: grammar

oellegaard · on Jan 19, 2015

I have a semi-old blog post about how we do it here: http://blog.kristian.io/aws/2014/06/28/how-we-moved-from-her...

jontro · on Jan 19, 2015

Is there any guide on how to integrate this in a deployment workflow? I would be very much interested in reading up on how to do this the best way.

lebski88 · on Jan 20, 2015

We use this approach at MixRadio, you can read about it on our blog: http://dev.mixrad.io/blog/2014/10/31/How-we-deploy-at-MixRad...

There is also a video of a talk about it: https://skillsmatter.com/skillscasts/6057-herding-cattle-wit...

We based our approach on Netflix but ended up building our own tools which we've now open sourced.

oellegaard · on Jan 21, 2015

If you're interested, shoot me an email and I will be happy to go through our setup. You can find my contact info at kristian.io

_ondq · on Jan 19, 2015

At the risk of being down voted, let me say that this is yet another AWS "feature" that is primarily a workaround for deficiencies in the platform.

regularfry · on Jan 20, 2015

If by "the platform" you mean "available computing hardware", then yes. I'm not sure that's a useful data point.

_ondq · on Jan 20, 2015

In the narrowest sense your sentence is correct. Perhaps you mean that you think this can be expected on any computing hardware, which is far from correct. If all you've ever used has been public cloud services you can be forgiven for having this misconception.

regularfry · on Jan 23, 2015

"Available" meaning "reasonable to expect to support a userbase the size of EC2's with." If you start with "gotta run standardish x86_64 Red Hat or Ubuntu by the million or so" and work outwards from there, you're not really in a space where bulletproof hardware looks tempting. VCPU lock-stepping might, though.

waitwaitwhay · on Jan 20, 2015

Can you explain that for me? I use Microsoft Azure a lot and there things like this just happen automatically. No configurations, alarms, etc.

biot · on Jan 19, 2015

Any reason why this isn't automatic? From the "Recover your instance" docs:

  Examples of problems that cause system status checks to
  fail include:

   * Loss of network connectivity
   * Loss of system power
   * Software issues on the physical host
   * Hardware issues on the physical host

All of these are on the physical host, which end users cannot control. So if AWS has an issue that kills your VM, if you don't have this setup then your instance is effectively dead?

perlgeek · on Jan 19, 2015

Loss of network connectivity sounds like it could be temporary. If you have long-running calculations and want to wait for the result, it might make sense to wait a bit longer and see if the network comes up eventually.

And there's no indication that the hardware and software "issues" are permanent or even fatal.

biot · on Jan 19, 2015

The strategy of "wait a bit longer" can be configured with an exact value for how long to wait via this recovery feature. However, given that you don't have any information about the failure's permanence or whether it can be resolved by AWS staff without taking the host down I don't see why waiting indefinitely is a particularly good option.

In some ways, I guess this answers my own question. Amazon doesn't know how long you might want to wait or if you have a VM that you would even want to have recovered, so configuring this lets you tell Amazon what your parameters around recovery should be.

larrymcp · on Jan 20, 2015

You know, it's funny but reading the EC2 forums you see several occasions where Amazon does reboot instances automatically after a failure.

I recall several posts where a customer asked a question along the lines of "Why did my instance reboot" and then someone from Amazon replied something to the effect of "Sorry there was an issue with the underlying hardware but we did restart your server on new hardware".

moe · on Jan 20, 2015

Any reason why this isn't automatic?

Perhaps it's a co-promotion for CloudWatch. I would guess quite a few of their users had never heard of or seen a use for CloudWatch. Some of them might now enable "detailed monitoring" for $3.50/mo per instance while they're at it.

alrs · on Jan 19, 2015

The ugly caveat isn't VPC, it's EBS.

This lands on the wrong side of pets-versus-cattle. AWS has been moving towards giving people what they want, but it's still best practice to use ephemeral storage and architect accordingly.

toomuchtodo · on Jan 19, 2015

> but it's still best practice to use ephemeral storage and architect accordingly

Its not worth the engineer time. Use EBS volumes, clean them up when they're no longer in use after termination. The only time you need local/ephemeral storage is swap or scratch space, or throughput you can't get from general or provisioned EBS.

Plus, you get auto recovery now without having to have architected for it ;)

gtaylor · on Jan 20, 2015

After I say this, they'll probably have a huge EBS-caused outage, but it feels like EBS is much more capable now than it was even a year or two ago.

The General Purpose SSDs and the Provisioned IOPS have more than handled our performance concerns. Since that last awful multi-AZ outage a year or two ago, there hasn't been much to deal with for us.

gst · on Jan 20, 2015

EBS is great if it works for requirements, but unfortunately the maximum size of an EBS volume is limited to only 1 TB. With SSD instance storage you can get up to 6.4 TB.

Rapzid · on Jan 20, 2015

Fortunately AWS is releasing 16TB volumes in the near future :)

dice · on Jan 19, 2015

If you've drunk the pets-vs-cattle koolaid to the point where your instances are all (or mostly) ephermal then you're probably also using auto-scaling groups and don't care if a particular single instance falls over.

peteretep · on Jan 20, 2015

    > If you've drunk the pets-vs-cattle koolaid

Generally drinking the kool-aid is a bad thing - any reason to favour pets?

Rapzid · on Jan 20, 2015

Here is a fun story. Amazon's RDS services are backed by EBS. They also restart the instances automatically if they encounter issues.

But sure, "Best PracticesTM".

alrs · on Jan 20, 2015

You're right, RDS is backed by EBS.

saryant · on Jan 19, 2015

I've been having a lot of issues with r3.large instances becoming unreachable lately. Hoping this can serve as a stopgap.

jeffbarr · on Jan 19, 2015

Have you noted your problem in the EC2 Forum or consulted AWS Support?

saryant · on Jan 19, 2015

edit: I was wrong. Ignore what was here before, AWS did respond and I just didn't notice. Apologies to AWS for speaking out of turn.

jeffbarr · on Jan 19, 2015

Huh, that's no good. Can you email me (address is in profile) and I'll see what's going on?

saryant · on Jan 19, 2015

I spoke without double-checking our account. We did receive a response about the forum creation problem and that was the result of my own misunderstanding. I'll update my original post.

andr · on Jan 20, 2015

I think CodeDeploy is quite an undervalued AWS tool. It's a combination of Puppet for server config and Heroku-style deploys. Together with AutoScaling it makes it trivial to set up any number of identical servers, without relying on custom AMIs or recovery.

tedunangst · on Jan 20, 2015

Wouldn't transparent migration to new hardware be even better? Isn't one of the advantages of virtualization the ability to move a running image from one machine to another?

chacham15 · on Jan 20, 2015

In many cases that is not possible. If the machine loses network connectivity, has a disk error, is infinite looping, is out of memory, etc. how is another machine supposed to access its data?

fletchowns · on Jan 19, 2015

An important note if you want to use this right away:

This feature is currently available for the C3, C4, M3, R3, and T2 instance types running in the US East (Northern Virginia) region; we plan to make it available in other regions as quickly as possible. The instances must be running within a VPC, must use EBS-backed storage, but cannot be Dedicated Instances.

wahnfrieden · on Jan 19, 2015

VPC-only sounds like a giant caveat, and it is, but this is a good opportunity to note that this is the trend now with AWS and the direction they're heading - (non-VPC) "EC2 Classic" is being gradually phased out, VPC is now the default for new accounts, and most new features are being added only to VPC. So, time for everyone to start thinking about migrating.

pjl · on Jan 19, 2015

AWS recently announced ClassicLink [1], which helps with migration: "In order to allow EC2-Classic instances to communicate with these resources, we are introducing a new feature known as ClassicLink. You can now enable this feature for any or all of your VPCs and then put your existing Classic instances in to VPC security groups."

[1] https://aws.amazon.com/blogs/aws/classiclink-private-communi...

mbell · on Jan 19, 2015

> this is the trend now with AWS and the direction they're heading - (non-VPC) "EC2 Classic" is being gradually phased out

New AWS accounts can't even use EC2 Classic, it's effectively deprecated at this point.

moe · on Jan 20, 2015

They will hopefully introduce a new "classic" (as an abstraction over VPC) at some point, unless they want to lose many low-end customers to "easier" clouds.

Having this level of control can be nice, but most of it really needs to be optional because for most deployments it does nothing other than add an excessive amount of unneeded complexity.

Some of the APIs are outright hostile, e.g. 'delete_vpc' which makes you track down half a dozen dependencies (without providing hints about which those might be) before you're allowed to delete a VPC.

idunno246 · on Jan 20, 2015

They kinda do, the default vpc they set up on new accounts has everything you need to use it without ever looking at vpc setup

mbell · on Jan 20, 2015

> unless they want to lose many low-end customers to "easier" clouds.

I've never gotten the impression that AWS is interested in building a 'cloud' for those less technically inclined. Heroku and others fill that void, EC2 is where you move after Heroku doesn't fit the bill and before dedicated hardware does.

wahnfrieden · on Jan 20, 2015

Beanstalk tries to serve that Heroku-level market, I think.

waskosky · on Jan 19, 2015

For people not yet on one of the newer instance types (HVM VPC), it is possible to migrate directly assuming you can tolerate a few minutes of downtime to make the final switch. This was the guide that helped me: https://forums.aws.amazon.com/thread.jspa?threadID=155526

j-kidd · on Jan 20, 2015

This shall be a great fit for the NAT/Bastion instance, since the high-availability setup has a few drawbacks: https://aws.amazon.com/articles/2781451301784570

kolev · on Jan 20, 2015

If you rely on something like this, you rely on nothing. This is like crutches for your broken architecture. For singleton roles, you could do an autoscaling group of one and do better.

kolev · on Jan 20, 2015

Not sure why the downvotes - at least ASGs can be expanded later unlike the one-offs. Plus, you should never have SPOFs anyway. There was a comment below about not all projects being of Netflix' scale - well, there's Digital Ocean for the smaller one - EC2 is not the most cost-effective solution for small projects anyway.

halayli · on Jan 20, 2015

This makes me so happy.