Operations testing - AWS Prescriptive Guidance

Operations testing

Like products, IT operations should be tested, end to end, on a regular cadence. Although enterprise customers have adopted operational testing for activities such as disaster recovery, operational testing should be extended to other operations domains such as incident and event management. Game-day scenarios, like fire drills, are activities that test how your processes, tools, and people react when an operations event occurs.

Here are some prescriptive game-day scenarios used to test incident and event management on AWS:

  • Amazon Elastic Compute Cloud (Amazon EC2) CPU utilization stress test

  • Amazon EC2 network stress test

  • Amazon EC2 memory stress  test

  • Amazon Elastic Container Service (Amazon ECS) task failure scenarios

  • AWS Lambda concurrency limits and cold start impact

  • Amazon API Gateway throttling and latency injection

  • Amazon Relational Database Service (Amazon RDS) memory stress test

  • Amazon RDS failover testing

  • Amazon RDS storage stress

  • Amazon DynamoDB throttling and hot partition testing

  • Availability Zone failure simulation

Consider using the following AWS services to run testing scenarios:

As a best practice, you should test your IT operations starting with incident and event management, and extend testing to other operational domains. It's also crucial to have a predetermined game-day schedule. Here are some example schedules:

Prod or non-prod schedule

Prod OR non-prod gameday schedule.

Prod and non-prod schedule

Prod AND non-prod gameday schedule.