Change Management - AWS Well-Architected Framework

Change Management

REL 6  How do you monitor workload resources?

Logs and metrics are powerful tools to gain insight into the health of your workload. You can configure your workload to monitor logs and metrics and send notifications when thresholds are crossed or significant events occur. Monitoring enables your workload to recognize when low-performance thresholds are crossed or failures occur, so it can recover automatically in response.

Best Practices:

  • Monitor all components for the workload (Generation): Monitor the components of the workload with Amazon CloudWatch or third-party tools. Monitor AWS services with Personal Health Dashboard

  • Define and calculate metrics (Aggregation): Store log data and apply filters where necessary to calculate metrics, such as counts of a specific log event, or latency calculated from log event timestamps

  • Send notifications (Real-time processing and alarming): Organizations that need to know, receive notifications when significant events occur

  • Automate responses (Real-time processing and alarming): Use automation to take action when an event is detected, for example, to replace failed components

  • Storage and Analytics: Collect log files and metrics histories and analyze these for broader trends and workload insights

  • Conduct reviews regularly: Frequently review how workload monitoring is implemented and update it based on significant events and changes

  • Monitor end-to-end tracing of requests through your system: Use AWS X-Ray or third-party tools so that developers can more easily analyze and debug distributed systems to understand how their applications and its underlying services are performing

REL 7  How do you design your workload to adapt to changes in demand?

A scalable workload provides elasticity to add or remove resources automatically so that they closely match the current demand at any given point in time.

Best Practices:

  • Use automation when obtaining or scaling resources: When replacing impaired resources or scaling your workload, automate the process by using managed AWS services, such as Amazon S3 and AWS Auto Scaling. You can also use third-party tools and AWS SDKs to automate scaling.

  • Obtain resources upon detection of impairment to a workload: Scale resources reactively when necessary if availability is impacted, to restore workload availability.

  • Obtain resources upon detection that more resources are needed for a workload: Scale resources proactively to meet demand and avoid availability impact.

  • Load test your workload: Adopt a load testing methodology to measure if scaling activity meets workload requirements.

REL 8  How do you implement change?

Controlled changes are necessary to deploy new functionality, and to ensure that the workloads and the operating environment are running known software and can be patched or replaced in a predictable manner. If these changes are uncontrolled, then it makes it difficult to predict the effect of these changes, or to address issues that arise because of them.

Best Practices:

  • Use runbooks for standard activities such as deployment: Runbooks are the predefined steps used to achieve specific outcomes. Use runbooks to perform standard activities, whether done manually or automatically. Examples include deploying a workload, patching it, or making DNS modifications.

  • Integrate functional testing as part of your deployment: Functional tests are run as part of automated deployment. If success criteria are not met, the pipeline is halted or rolled back.

  • Integrate resiliency testing as part of your deployment: Resiliency tests (as part of chaos engineering) are run as part of the automated deployment pipeline in a pre-prod environment.

  • Deploy using immutable infrastructure: This is a model that mandates that no updates, security patches, or configuration changes happen in-place on production workloads. When a change is needed, the architecture is built onto new infrastructure and deployed into production.

  • Deploy changes with automation: Deployments and patching are automated to eliminate negative impact.