Review: AWS Fault Injection Simulator (FIS) – Chaos as a Service?

Michael Wittig – Updated 15 Apr 2021

AWS allows us to run applications distributed across EC2 instances and availability zones. By adding load balancers or message queues to the architecture, we can achieve fault tolerance or high availability. But how can we test that our system can survive faults in reality? Assuming an application has five consumers and seven downstream dependencies. What happens if one of them fails? Are all timeouts configured accurately? Are all applications retrying? What happens if the network is slow? So many things can go wrong. It is not possible to understand all consequences upfront. Therefore, a new approach emerged: Chaos Engineering. With chaos engineering, we simulate faults in our systems and observe the consequences. The trick is that we can simulate faults as often as we wish. We don’t have to wait for the one day a year where things go horribly wrong. AWS released Fault Injection Simulator (FIS) as a tool to run controlled fault experiments within our AWS accounts.

Chaos as a Service

In this blog post, you will get an in-depth understanding of what FIS can do for you today. Not what marketing is hoping it to become.

Do you prefer listening to a podcast episode over reading a blog post? Here you go!

Concepts

Everything starts with an experiment template. The experiment template defines the targets that participate in the experiment. Supported targets are:

EC2 instances
ECS container instances
EKS node groups
RDS clusters & instances
IAM roles

You can select targets by ARN, tag, or filter. I used the aws:autoscaling:groupName tag to select all EC2 instances launched by one auto scaling group. The selection mode picks targets by an absolute or relative number in a random way. I picked one EC2 instance randomly from the fleet of instances launched by my auto scaling group.

The actions define the injected faults. You can run actions in parallel or sequence. Possible actions are (see next section for details):

AWS API level errors for the EC2 service
Stop/reboot/terminate EC2 instances
Run SSM commands on EC2 instances to stress CPU or memory, add network latency, or kill a process
Reboot RDS instance
Failover RDS cluster
Drain ECS container instance
Terminate EKS node group instance

My experiment looks like this:

Terminate one EC2 instance from my auto scaling group
Wait 8 minutes
Stress CPU on one EC2 instance for three minutes
Wait 5 minutes

The most important question is this: What do you expect based on the chaos you simulate? My architecture is using SQS to decouple producers from my EC2 consumers. I expect that if an instance is randomly terminated, the message is processed again by another EC2 instance. How long will it take? I expect the worst-case scenario to be: 2 * processing time + SQS visibility timeout. In my case, 5 minutes is a reasonable threshold that should never be reached.

The stop condition references a CloudWatch alarm (or alarms) to stop and roll back the experiment if things go wrong. Based on my expectation, I created an alarm to monitor the age of the oldest message in my SQS queue with a threshold of 5 minutes. Suppose my experiment impacts the application so badly that the queue contains messages older than five minutes. In that case, the experiment is stopped, and I have to double-check my assumption or fix the problem.

Last but not least, you can run an experiment based on the template created before.

Supported chaos

If FIS is helpful for you or not depends on the technology you use to run your workloads. The following list summarizes the chaos you can simulate today:

EC2
- API
  - Simulate internal errors
  - Simulate throttle errors
  - Simulate unavailable errors
- Reboot EC2 instance
- Stop EC2 instance
- Terminate EC2 instance
- SSM commands (requires SSM agent):
  - CPU stress (Linux only)
  - Memory stress (Linux only)
  - Network latency (Linux only)
  - Kill process (Linux only)
ECS
- Drain cluster instance
EKS
- Terminate node group instance
RDS
- Reboot database instance
- Failover database cluster

In a nutshell, if your workloads run on EC2 Linux, you are lucky. Otherwise, FIS is not yet for you.

Things I miss in no particular order:

API chaos for DynamoDB, SQS, S3, …
Lambda
EC2: Run out of disk space
EC2/EBS: Slow disk
EC2: Network packet loss
EC2/Java: GC stress
ECS/EKS/Fargate: Container level stress
EC2: Spot Market capacity shortage
EC2: AZ failure
VPC: network partitions
VPC: NAT failure

I bet you can add many items to this list as well! Feel free to reach out to me. I will add your ideas to the list.

Integrations

AWS FIS is like a lost island. No other service likes to integrate with FIS yet. This leads to the problem: How can you start experiments in an automated way? We have no CodePipeline integration. You could start an experiment using the AWS CLI. But the CLI does not support any wait helpers to wait for the experiment to be complete. You need to implement the status polling yourself.

The unbelievable fact that made me smile: CloudFormation support is available while Terraform support is missing. To save you some time and frustration, I share the CloudFormation template I used to set up my experiment template:

ExperimentTemplate:
  Type: 'AWS::FIS::ExperimentTemplate'
  Properties:
    Actions:
      'terminate-asg':
        ActionId: 'aws:ec2:terminate-instances'
        Targets:
          Instances: asg # Instances seems to be an undocumented magic value
      'await-terminate-asg':
        ActionId: 'aws:fis:wait'
        Parameters:
          duration: PT8M
        StartAfter:
        - 'terminate-asg'
      'cpu-stress-asg':
        ActionId: 'aws:ssm:send-command'
        Parameters:
          documentArn: !Sub 'arn:${AWS::Partition}:ssm:${AWS::Region}::document/AWSFIS-Run-CPU-Stress'
          documentVersion: 4
          documentParameters: '{"DurationSeconds":"180", "InstallDependencies":"True"}'
          duration: 'PT3M'
        StartAfter:
        - 'await-terminate-asg'
        Targets:
          Instances: asg # Instances seems to be an undocumented magic value
      'await-cpu-stress-asg':
        ActionId: 'aws:fis:wait'
        Parameters:
          duration: PT5M
        StartAfter:
        - 'cpu-stress-asg'
    Description: 'cloudonaut'
    RoleArn: !GetAtt 'Role.Arn'
    StopConditions:
    - Source: 'aws:cloudwatch:alarm'
      Value: !GetAtt 'Alarm.Arn'
    Tags: # Why are tags required? Only AWS knows.
      PLACE: HOLDER
    Targets:
      asg:
        ResourceTags:
          'aws:autoscaling:groupName': 'NAME_OF_YOUR_ASG'
        ResourceType: 'aws:ec2:instance'
        SelectionMode: 'COUNT(1)' # alternative 'PERCENT(50)'
Alarm:
  Type: 'AWS::CloudWatch::Alarm'
  Properties:
    AlarmDescription: 'FIS stop condition'
    Namespace: 'AWS/SQS'
    MetricName: ApproximateAgeOfOldestMessage
    Dimensions:
    - Name: QueueName
      Value: 'NAME_OF_YOUR_QUEUE'
    Statistic: Maximum
    Period: 60
    EvaluationPeriods: 1
    Threshold: 300 # 5 minutes
    ComparisonOperator: GreaterThanOrEqualToThreshold
    TreatMissingData: notBreaching
Role:
  Type: 'AWS::IAM::Role'
  Properties:
    AssumeRolePolicyDocument:
      Version: '2012-10-17'
      Statement:
      - Effect: Allow
        Principal:
          Service: 'fis.amazonaws.com'
        Action: 'sts:AssumeRole'
    Policies:
    - PolicyName: fis # Source https://docs.aws.amazon.com/fis/latest/userguide/getting-started-iam.html#getting-started-iam-service-role
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
        - Sid: AllowFISExperimentRoleReadOnly
          Effect: Allow
          Action:
          - 'ec2:DescribeInstances'
          - 'ecs:DescribeClusters'
          - 'ecs:ListContainerInstances'
          - 'eks:DescribeNodegroup'
          - 'iam:ListRoles'
          - 'rds:DescribeDBInstances'
          - 'rds:DescribeDbClusters'
          - 'ssm:ListCommands'
          Resource: '*'
        - Sid: AllowFISExperimentRoleEC2Actions
          Effect: Allow
          Action:
          - 'ec2:RebootInstances'
          - 'ec2:StopInstances'
          - 'ec2:StartInstances'
          - 'ec2:TerminateInstances'
          Resource: !Sub 'arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:instance/*'
        - Sid: AllowFISExperimentRoleECSActions
          Effect: Allow
          Action:
          - 'ecs:UpdateContainerInstancesState'
          - 'ecs:ListContainerInstances'
          Resource: !Sub 'arn:${AWS::Partition}:ecs:${AWS::Region}:${AWS::AccountId}:container-instance/*'
        - Sid: AllowFISExperimentRoleEKSActions
          Effect: Allow
          Action: 'ec2:TerminateInstances'
          Resource: !Sub 'arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:instance/*'
        - Sid: AllowFISExperimentRoleFISActions
          Effect: Allow
          Action:
          - 'fis:InjectApiInternalError'
          - 'fis:InjectApiThrottleError'
          - 'fis:InjectApiUnavailableError'
          Resource: !Sub 'arn:${AWS::Partition}:fis:${AWS::Region}:${AWS::AccountId}:experiment/*'
        - Sid: AllowFISExperimentRoleRDSReboot
          Effect: Allow
          Action: 'rds:RebootDBInstance'
          Resource: !Sub 'arn:${AWS::Partition}:rds:${AWS::Region}:${AWS::AccountId}:db:*'
        - Sid: AllowFISExperimentRoleRDSFailOver
          Effect: Allow
          Action: 'rds:FailoverDBCluster'
          Resource: !Sub 'arn:${AWS::Partition}:rds:${AWS::Region}:${AWS::AccountId}:cluster:*'
        - Sid: AllowFISExperimentRoleSSMSendCommand
          Effect: Allow
          Action: 'ssm:SendCommand'
          Resource:
          - !Sub 'arn:${AWS::Partition}:ec2:${AWS::Region}:${AWS::AccountId}:instance/*'
          - !Sub 'arn:${AWS::Partition}:ssm:${AWS::Region}::document/*' # AWS managed documents
          - !Sub 'arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:document/*'
        - Sid: AllowFISExperimentRoleSSMCancelCommand
          Effect: Allow
          Action: 'ssm:CancelCommand'
          Resource: '*'

Pricing

You pay $0.10 per minute for each action running.

Fun fact: FIS wins the award of most expensive service while doing nothing:

$0.0000105 Lambda function waiting for five minutes
$0.000025 Step Functions waiting for five minutes
$0.00035 EC2 instance t4g.nano waiting for five minutes
$0.50 FIS waiting for five minutes

Service Maturity Table

The following table summarizes the maturity of the service:

Criteria	Support	Score
Feature Completeness	✅	5
Chaos support: EC2	✅	7
Chaos support: Containers	⚠️	2
Chaos support: Serverless	❌	0
Documentation detailedness	✅	7
Tags (Grouping + Billing)	✅	10
CloudFormation + Terraform support	✅	4
Emits CloudWatch Events	❌	0
IAM granularity	✅	10
Auditing via AWS CloudTrail	✅	10
Available in all commercial regions	✅	10
Total Maturity Score (0-10)	⚠️	5.9

Our maturity score for Fault Injection Simulator (FIS) is 5.7 on a scale from 0 to 10. I want to highlight that the IAM granularity is excellent! We have resource-level constraints, and we can implement tag-based policies if needed.

Summary

Besides the lacking chaos support, FIS is mature compared to other newly launched services. If you run workloads on EC2 and your architecture is designed to have no single point of failure, I would recommend giving FIS a try. You will likely learn something about your architecture, and the chances that you discover a bug are high as well!

One thing to keep in mind with EC2 based workloads: If you terminate EC2 instances that sit behind user-facing load balancers, your experiment will impact your users. The load balancer does not retry a request if the target dies! Neither does the user’s browser. I wish ALB’s could retry for us!

PS: I think I discovered a bug in FIS. The tag-based EC2 instance selection includes terminated instances. I reached out to AWS support to clarify. Turns out that this is expected behavior (not sure if I agree). If we want to filter out terminated instances we better use the filter based approach.

Michael Wittig

I’ve been building on AWS since 2012 together with my brother Andreas. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, we’re currently working on bucketAV, HyperEnv for GitHub Actions, and marbot.

Here are the contact options for feedback and questions.