Operational Readiness - Operational Excellence Pillar

Operational Readiness

Evaluate the operational readiness of your workload, processes, procedures, and personnel to understand the operational risks related to your workload.

You should use a consistent process (including manual or automated checklists) to know when you are ready to go live with your workload or a change. This will also enable you to find any areas that you need to make plans to address. You will have runbooks that document your routine activities and playbooks that guide your processes for issue resolution.

Ensure personnel capability: Have a mechanism to validate that you have the appropriate number of trained personnel to provide support for operational needs. Train personnel and adjust personnel capacity as necessary to maintain effective support.

You will need to have enough team members to cover all activities (including on-call). Ensure that your teams have the necessary skills to be successful with training on your workload, your operations tools, and AWS.

AWS provides resources, including the AWS Getting Started Resource Center, AWS Blogs, AWS Online Tech Talks, AWS Events and Webinars, and the AWS Well-Architected Labs, that provide guidance, examples, and detailed walkthroughs to educate your teams. Additionally, AWS Training and Certification provides some free training through self-paced digital courses on AWS fundamentals. You can also register for instructor-led training to further support the development of your teams’ AWS skills.

Ensure consistent review of operational readiness: Ensure you have a consistent review of your readiness to operate a workload. Reviews must include, at a minimum, the operational readiness of the teams and the workload, and security requirements. Implement review activities in code and trigger automated review in response to events where appropriate, to ensure consistency, speed of execution, and reduce errors caused by manual processes.

You should automate workload configuration testing by making baselines using AWS Config and checking your configurations using AWS Config Rules. You can evaluate security requirements and compliance using the services and features of AWS Security Hub. These services will aid in determining if your workloads are aligned with best practices and standards.

Use runbooks to perform procedures: Runbooks are documented procedures to achieve specific outcomes. Enable consistent and prompt responses to well-understood events by documenting procedures in runbooks. Implement runbooks as code and trigger the execution of runbooks in response to events where appropriate, to ensure consistency, speed responses, and reduce errors caused by manual processes.

Use playbooks to identify issues: Playbooks are documented processes to investigate issues. Enable consistent and prompt responses to failure scenarios by documenting investigation processes in playbooks. Implement playbooks as code and trigger playbook execution in response to events where appropriate, to ensure consistency, speed responses, and reduce errors caused by manual processes.

AWS allows you to treat your operations as code, scripting your runbook and playbook activities to reduce the risk of human error. You can use Resource Tags or Resource Groups with your scripts to selectively execute based on criteria you have defined (for example, environment, owner, role, or version).

You can use scripted procedures to enable automation by triggering the scripts in response to events. By treating both your operations and workloads as code, you can also script and automate the evaluation of your environments.

You should script procedures on your instances using AWS Systems Manager (SSM) Run Command, use AWS Systems Manager Automation to script actions and create workflows on instances and other resources, or use AWS Lambda serverless compute functions to script responses to events across AWS service APIs and your own custom interfaces. You can also use AWS Step Functions to coordinate multiple AWS services scripted into serverless workflows. Automate your responses by triggering these scripts using CloudWatch Events and route desired events to additional operations support systems using Amazon EventBridge.

You should test your procedures, failure scenarios, and the success of your responses (for example, by holding game days and testing prior to going live) to identify areas you need to plan to address.

On AWS, you can create temporary parallel environments to lower the risk, effort, and cost of experimentation and testing. Automate the deployment of these environments using AWS CloudFormation to ensure consistent implementations of your temporary environments. Perform failure injection testing in safe environments where there will be acceptable or no customer impact, and develop or revise appropriate responses.

Make informed decisions to deploy systems and changes: Evaluate the capabilities of the team to support the workload and the workload's compliance with governance. Evaluate these against the benefits of deployment when determining whether to transition a system or change into production. Understand the benefits and risks to make informed decisions.

Use “pre-mortems” to anticipate failure and create procedures where appropriate. When you make changes to the checklists you use to evaluate your workloads, plan what you will do with live systems that no longer comply.