Operations testing
Like products, IT operations should be tested, end to end, on a regular cadence. Although enterprise customers have adopted operational testing for activities such as disaster recovery, operational testing should be extended to other operations domains such as incident and event management. Game-day scenarios, like fire drills, are activities that test how your processes, tools, and people react when an operations event occurs.
Here are some prescriptive game-day scenarios used to test incident and event management on AWS:
-
Amazon Elastic Compute Cloud (Amazon EC2) CPU utilization stress test
-
Amazon EC2 network stress test
-
Amazon EC2 memory stress test
-
Amazon Elastic Container Service (Amazon ECS) task failure scenarios
-
AWS Lambda concurrency limits and cold start impact
-
Amazon API Gateway throttling and latency injection
-
Amazon Relational Database Service (Amazon RDS) memory stress test
-
Amazon RDS failover testing
-
Amazon RDS storage stress
-
Amazon DynamoDB throttling and hot partition testing
-
Availability Zone failure simulation
Consider using the following AWS services to run testing scenarios:
-
AWS Fault Injection Service (AWS FIS)
for controlled chaos engineering experiments -
Amazon CloudWatch Synthetics for application endpoint testing
-
Automation, a capability of AWS Systems Manager, for orchestrating complex scenarios
-
AWS Resilience Hub
for assessing and improving application resiliency
As a best practice, you should test your IT operations starting with incident and event management, and extend testing to other operational domains. It's also crucial to have a predetermined game-day schedule. Here are some example schedules:
Prod or non-prod schedule

Prod and non-prod schedule
