Review: AWS Fault Injection Simulator (FIS) – Chaos as a Service?
AWS allows us to run applications distributed across EC2 instances and availability zones. By adding load balancers or message queues to the architecture, we can achieve fault tolerance or high availability. But how can we test that our system can survive faults in reality? Assuming an application has five consumers and seven downstream dependencies. What happens if one of them fails? Are all timeouts configured accurately? Are all applications retrying? What happens if the network is slow? So many things can go wrong. It is not possible to understand all consequences upfront. Therefore, a new approach emerged: Chaos Engineering. With chaos engineering, we simulate faults in our systems and observe the consequences. The trick is that we can simulate faults as often as we wish. We don’t have to wait for the one day a year where things go horribly wrong. AWS released Fault Injection Simulator (FIS) as a tool to run controlled fault experiments within our AWS accounts.
In this blog post, you will get an in-depth understanding of what FIS can do for you today. Not what marketing is hoping it to become.
Do you prefer listening to a podcast episode over reading a blog post? Here you go!
Everything starts with an experiment template. The experiment template defines the targets that participate in the experiment. Supported targets are:
- EC2 instances
- ECS container instances
- EKS node groups
- RDS clusters & instances
- IAM roles
You can select targets by ARN, tag, or filter. I used the
aws:autoscaling:groupName tag to select all EC2 instances launched by one auto scaling group. The selection mode picks targets by an absolute or relative number in a random way. I picked one EC2 instance randomly from the fleet of instances launched by my auto scaling group.
The actions define the injected faults. You can run actions in parallel or sequence. Possible actions are (see next section for details):
- AWS API level errors for the EC2 service
- Stop/reboot/terminate EC2 instances
- Run SSM commands on EC2 instances to stress CPU or memory, add network latency, or kill a process
- Reboot RDS instance
- Failover RDS cluster
- Drain ECS container instance
- Terminate EKS node group instance
My experiment looks like this:
- Terminate one EC2 instance from my auto scaling group
- Wait 8 minutes
- Stress CPU on one EC2 instance for three minutes
- Wait 5 minutes
The most important question is this: What do you expect based on the chaos you simulate? My architecture is using SQS to decouple producers from my EC2 consumers. I expect that if an instance is randomly terminated, the message is processed again by another EC2 instance. How long will it take? I expect the worst-case scenario to be:
2 * processing time + SQS visibility timeout. In my case, 5 minutes is a reasonable threshold that should never be reached.
The stop condition references a CloudWatch alarm (or alarms) to stop and roll back the experiment if things go wrong. Based on my expectation, I created an alarm to monitor the age of the oldest message in my SQS queue with a threshold of 5 minutes. Suppose my experiment impacts the application so badly that the queue contains messages older than five minutes. In that case, the experiment is stopped, and I have to double-check my assumption or fix the problem.
Last but not least, you can run an experiment based on the template created before.
If FIS is helpful for you or not depends on the technology you use to run your workloads. The following list summarizes the chaos you can simulate today:
- Simulate internal errors
- Simulate throttle errors
- Simulate unavailable errors
- Reboot EC2 instance
- Stop EC2 instance
- Terminate EC2 instance
- SSM commands (requires SSM agent):
- CPU stress (Linux only)
- Memory stress (Linux only)
- Network latency (Linux only)
- Kill process (Linux only)
- Drain cluster instance
- Terminate node group instance
- Reboot database instance
- Failover database cluster
In a nutshell, if your workloads run on EC2 Linux, you are lucky. Otherwise, FIS is not yet for you.
Things I miss in no particular order:
- API chaos for DynamoDB, SQS, S3, …
- EC2: Run out of disk space
- EC2/EBS: Slow disk
- EC2: Network packet loss
- EC2/Java: GC stress
- ECS/EKS/Fargate: Container level stress
- EC2: Spot Market capacity shortage
- EC2: AZ failure
- VPC: network partitions
- VPC: NAT failure
I bet you can add many items to this list as well! Feel free to reach out to me. I will add your ideas to the list.
AWS FIS is like a lost island. No other service likes to integrate with FIS yet. This leads to the problem: How can you start experiments in an automated way? We have no CodePipeline integration. You could start an experiment using the AWS CLI. But the CLI does not support any wait helpers to wait for the experiment to be complete. You need to implement the status polling yourself.
The unbelievable fact that made me smile: CloudFormation support is available while Terraform support is missing. To save you some time and frustration, I share the CloudFormation template I used to set up my experiment template:
You pay $0.10 per minute for each action running.
Fun fact: FIS wins the award of most expensive service while doing nothing:
- $0.0000105 Lambda function waiting for five minutes
- $0.000025 Step Functions waiting for five minutes
- $0.00035 EC2 instance t4g.nano waiting for five minutes
- $0.50 FIS waiting for five minutes
The following table summarizes the maturity of the service:
|Chaos support: EC2
|Chaos support: Containers
|Chaos support: Serverless
|Tags (Grouping + Billing)
|CloudFormation + Terraform support
|Emits CloudWatch Events
|Auditing via AWS CloudTrail
|Available in all commercial regions
|Total Maturity Score (0-10)
Our maturity score for Fault Injection Simulator (FIS) is 5.7 on a scale from 0 to 10. I want to highlight that the IAM granularity is excellent! We have resource-level constraints, and we can implement tag-based policies if needed.
Besides the lacking chaos support, FIS is mature compared to other newly launched services. If you run workloads on EC2 and your architecture is designed to have no single point of failure, I would recommend giving FIS a try. You will likely learn something about your architecture, and the chances that you discover a bug are high as well!
One thing to keep in mind with EC2 based workloads: If you terminate EC2 instances that sit behind user-facing load balancers, your experiment will impact your users. The load balancer does not retry a request if the target dies! Neither does the user’s browser. I wish ALB’s could retry for us!
PS: I think I discovered a bug in FIS. The tag-based EC2 instance selection includes terminated instances. I reached out to AWS support to clarify. Turns out that this is expected behavior (not sure if I agree). If we want to filter out terminated instances we better use the filter based approach.