Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Auto Recovery for Amazon EC2 (amazon.com)
167 points by tshtf on Jan 19, 2015 | hide | past | favorite | 49 comments


Finally, it's crazy that they've haven't implemented this earlier, and why isn't it enabled by default, like on GCE? We've had for a long time an app that just polls the ec2 api and looks for impaired instances and then automatically restarts them. We have about 2-10 impaired/scheduled-for-reboot/on-deprecated hardware-instance per month so that app is quite a time-saver.


Please note that this is for EBS backed instances only.

If you want something similar for ephemeral instances, do what we do: min 1 max 1 auto scaling groups. We've found that Amazon is pretty good at catching bad instances and terminating them, although on occasion we do have to terminate an instance manually. The autoscaling group takes care of the rest.


Heavy EC2 user here. This doesn't solve your problems, if you want to do this right, setup an EC2 Auto Scaling group and build an image each time you need to change your server. That is the proven way most large deployments work, including Netflix.


Yes but most people don't have large deployments like Netflix. The main reason Neftlix and co build entire images is because if you're scaling up and down by thousands of instances then the extra time for those instances to do all the build, configuration and so on is material. So you're better to do it once and essentially CTRL+C CTRL+V.

If you're only scaling up a few instances at a time then your number one consideration is probably making life easy for yourself.


We tend to take an in-between approach, baking base AMIs as needed but configuring and installing code with Ansible at boot time. We've got scripts to cut tarballs with an Ansible playbook plus support files from any tag in our repo. Then it's just a matter of specifying the tag in user-data when starting an instance. We do that through CloudFormation.

edit: grammar


I have a semi-old blog post about how we do it here: http://blog.kristian.io/aws/2014/06/28/how-we-moved-from-her...


Is there any guide on how to integrate this in a deployment workflow? I would be very much interested in reading up on how to do this the best way.


We use this approach at MixRadio, you can read about it on our blog: http://dev.mixrad.io/blog/2014/10/31/How-we-deploy-at-MixRad...

There is also a video of a talk about it: https://skillsmatter.com/skillscasts/6057-herding-cattle-wit...

We based our approach on Netflix but ended up building our own tools which we've now open sourced.


If you're interested, shoot me an email and I will be happy to go through our setup. You can find my contact info at kristian.io


At the risk of being down voted, let me say that this is yet another AWS "feature" that is primarily a workaround for deficiencies in the platform.


If by "the platform" you mean "available computing hardware", then yes. I'm not sure that's a useful data point.


In the narrowest sense your sentence is correct. Perhaps you mean that you think this can be expected on any computing hardware, which is far from correct. If all you've ever used has been public cloud services you can be forgiven for having this misconception.


"Available" meaning "reasonable to expect to support a userbase the size of EC2's with." If you start with "gotta run standardish x86_64 Red Hat or Ubuntu by the million or so" and work outwards from there, you're not really in a space where bulletproof hardware looks tempting. VCPU lock-stepping might, though.


Can you explain that for me? I use Microsoft Azure a lot and there things like this just happen automatically. No configurations, alarms, etc.


Any reason why this isn't automatic? From the "Recover your instance" docs:

  Examples of problems that cause system status checks to
  fail include:

   * Loss of network connectivity
   * Loss of system power
   * Software issues on the physical host
   * Hardware issues on the physical host
All of these are on the physical host, which end users cannot control. So if AWS has an issue that kills your VM, if you don't have this setup then your instance is effectively dead?


Loss of network connectivity sounds like it could be temporary. If you have long-running calculations and want to wait for the result, it might make sense to wait a bit longer and see if the network comes up eventually.

And there's no indication that the hardware and software "issues" are permanent or even fatal.


The strategy of "wait a bit longer" can be configured with an exact value for how long to wait via this recovery feature. However, given that you don't have any information about the failure's permanence or whether it can be resolved by AWS staff without taking the host down I don't see why waiting indefinitely is a particularly good option.

In some ways, I guess this answers my own question. Amazon doesn't know how long you might want to wait or if you have a VM that you would even want to have recovered, so configuring this lets you tell Amazon what your parameters around recovery should be.


You know, it's funny but reading the EC2 forums you see several occasions where Amazon does reboot instances automatically after a failure.

I recall several posts where a customer asked a question along the lines of "Why did my instance reboot" and then someone from Amazon replied something to the effect of "Sorry there was an issue with the underlying hardware but we did restart your server on new hardware".


Any reason why this isn't automatic?

Perhaps it's a co-promotion for CloudWatch. I would guess quite a few of their users had never heard of or seen a use for CloudWatch. Some of them might now enable "detailed monitoring" for $3.50/mo per instance while they're at it.


The ugly caveat isn't VPC, it's EBS.

This lands on the wrong side of pets-versus-cattle. AWS has been moving towards giving people what they want, but it's still best practice to use ephemeral storage and architect accordingly.


> but it's still best practice to use ephemeral storage and architect accordingly

Its not worth the engineer time. Use EBS volumes, clean them up when they're no longer in use after termination. The only time you need local/ephemeral storage is swap or scratch space, or throughput you can't get from general or provisioned EBS.

Plus, you get auto recovery now without having to have architected for it ;)


After I say this, they'll probably have a huge EBS-caused outage, but it feels like EBS is much more capable now than it was even a year or two ago.

The General Purpose SSDs and the Provisioned IOPS have more than handled our performance concerns. Since that last awful multi-AZ outage a year or two ago, there hasn't been much to deal with for us.


EBS is great if it works for requirements, but unfortunately the maximum size of an EBS volume is limited to only 1 TB. With SSD instance storage you can get up to 6.4 TB.


Fortunately AWS is releasing 16TB volumes in the near future :)


If you've drunk the pets-vs-cattle koolaid to the point where your instances are all (or mostly) ephermal then you're probably also using auto-scaling groups and don't care if a particular single instance falls over.


    > If you've drunk the pets-vs-cattle koolaid
Generally drinking the kool-aid is a bad thing - any reason to favour pets?


Here is a fun story. Amazon's RDS services are backed by EBS. They also restart the instances automatically if they encounter issues.

But sure, "Best PracticesTM".


You're right, RDS is backed by EBS.


I've been having a lot of issues with r3.large instances becoming unreachable lately. Hoping this can serve as a stopgap.


Have you noted your problem in the EC2 Forum or consulted AWS Support?


edit: I was wrong. Ignore what was here before, AWS did respond and I just didn't notice. Apologies to AWS for speaking out of turn.


Huh, that's no good. Can you email me (address is in profile) and I'll see what's going on?


I spoke without double-checking our account. We did receive a response about the forum creation problem and that was the result of my own misunderstanding. I'll update my original post.


I think CodeDeploy is quite an undervalued AWS tool. It's a combination of Puppet for server config and Heroku-style deploys. Together with AutoScaling it makes it trivial to set up any number of identical servers, without relying on custom AMIs or recovery.


Wouldn't transparent migration to new hardware be even better? Isn't one of the advantages of virtualization the ability to move a running image from one machine to another?


In many cases that is not possible. If the machine loses network connectivity, has a disk error, is infinite looping, is out of memory, etc. how is another machine supposed to access its data?


An important note if you want to use this right away:

This feature is currently available for the C3, C4, M3, R3, and T2 instance types running in the US East (Northern Virginia) region; we plan to make it available in other regions as quickly as possible. The instances must be running within a VPC, must use EBS-backed storage, but cannot be Dedicated Instances.


VPC-only sounds like a giant caveat, and it is, but this is a good opportunity to note that this is the trend now with AWS and the direction they're heading - (non-VPC) "EC2 Classic" is being gradually phased out, VPC is now the default for new accounts, and most new features are being added only to VPC. So, time for everyone to start thinking about migrating.


AWS recently announced ClassicLink [1], which helps with migration: "In order to allow EC2-Classic instances to communicate with these resources, we are introducing a new feature known as ClassicLink. You can now enable this feature for any or all of your VPCs and then put your existing Classic instances in to VPC security groups."

[1] https://aws.amazon.com/blogs/aws/classiclink-private-communi...


> this is the trend now with AWS and the direction they're heading - (non-VPC) "EC2 Classic" is being gradually phased out

New AWS accounts can't even use EC2 Classic, it's effectively deprecated at this point.


They will hopefully introduce a new "classic" (as an abstraction over VPC) at some point, unless they want to lose many low-end customers to "easier" clouds.

Having this level of control can be nice, but most of it really needs to be optional because for most deployments it does nothing other than add an excessive amount of unneeded complexity.

Some of the APIs are outright hostile, e.g. 'delete_vpc' which makes you track down half a dozen dependencies (without providing hints about which those might be) before you're allowed to delete a VPC.


They kinda do, the default vpc they set up on new accounts has everything you need to use it without ever looking at vpc setup


> unless they want to lose many low-end customers to "easier" clouds.

I've never gotten the impression that AWS is interested in building a 'cloud' for those less technically inclined. Heroku and others fill that void, EC2 is where you move after Heroku doesn't fit the bill and before dedicated hardware does.


Beanstalk tries to serve that Heroku-level market, I think.


For people not yet on one of the newer instance types (HVM VPC), it is possible to migrate directly assuming you can tolerate a few minutes of downtime to make the final switch. This was the guide that helped me: https://forums.aws.amazon.com/thread.jspa?threadID=155526


This shall be a great fit for the NAT/Bastion instance, since the high-availability setup has a few drawbacks: https://aws.amazon.com/articles/2781451301784570


If you rely on something like this, you rely on nothing. This is like crutches for your broken architecture. For singleton roles, you could do an autoscaling group of one and do better.


Not sure why the downvotes - at least ASGs can be expanded later unlike the one-offs. Plus, you should never have SPOFs anyway. There was a comment below about not all projects being of Netflix' scale - well, there's Digital Ocean for the smaller one - EC2 is not the most cost-effective solution for small projects anyway.


This makes me so happy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: