Adding EC2 Instance Recovery Alarms with CloudFormation

2019-09-11 Automatically Push the “Recover” Button

Instance Recovery is a little-advertised, little-used feature of EC2. It doesn’t take long to set up and promises to recover your instance on the rare occasion that the underlying hardware fails. Recovery resumes the instance on new hardware, retaining its instance ID, private IP addresses, Elastic IP addresses, and all instance metadata.

I’ve deployed it on “snowflake” instances that don’t have the luxury of using Auto Scaling Groups. This gives me a little extra uptime assurance. I don’t think I’ve actually ever seen any EC2 instance get auto-recovered though.

Maybe I’m cargo-culting it, but it’s not much work to set up, so it feels like an easy (potential) win.

Update (2019-09-13): I asked on reddit for examples and Redditron-2000-4 replied. They have 1800 EC2 instances and see 1-3 automatic recoveries a month. This is a failover rate of 0.05%-0.15%. Small but significant if you’re looking for even 99.9% uptime!

You can click to set up a recover alarm for an instance on the console as per the documentation.

I like automating with CloudFormation though. I converted the resulting manually-created alarm into a template snippet some time ago, and copy paste it between projects. Let’s take a look at it.

If you have an EC2 instance in your CloudFormation template’s Resources like so:

  Type: AWS::EC2::Instance
      LaunchTemplateId: !Ref LaunchTemplateId
      Version: !Ref LaunchTemplateVersion
    - Key: Name
      Value: Snowflake-Production

…then a basic recovery alarm for it would look like this:

  # Recover the instance if its EC2 status checks fail, as per:
  Type: AWS::CloudWatch::Alarm
    - !Sub arn:aws:automate:${AWS::Region}:ec2:recover
    AlarmDescription: Recover instance if its status checks fail.
    Namespace: AWS/EC2
    MetricName: StatusCheckFailed_System
      - Name: InstanceId
        Value: !Ref Ec2Instance
    EvaluationPeriods: 2
    Period: 60
    Statistic: Minimum
    ComparisonOperator: GreaterThanThreshold
    Threshold: 0

The snippet EvaluationPeriods set to gives you 2 minutes of a failing instance before it’s recovered. This is as recommended in the manual setup documentation.

That documentation page also shows how to create alarms to stop, terminate, or reboot instances. I’ve not needed to do any of those, but if you do, you should be able to adapt this template snippet to match.


Hope this helps you recover,


🎉 My book Speed Up Your Django Tests is now up to date for Django 3.2. 🎉
Buy now on Gumroad

Subscribe via RSS, Twitter, or email:

One summary email a week, no spam, I pinky promise.

Related posts:

Tags: aws, cloudformation