Change management in the cloud - Change Management in the Cloud

Change management in the cloud

It's key to remember that all changes should be delivering business value and that change management should be focused on optimizing business risk in a way that maximizes productivity while minimizing wasted effort or cost. The AWS Cloud enables automation that optimizes this business risk by:

  • Minimizing the possibility of human error,

  • Enabling the creation of identical environments for predictable and testable outcomes to changes,

  • Removing the requirement to submit changes to scale infrastructure to meet business demand, and

  • Automatically recovering from failure and rolling back failed changes.

The benefits of automation can dramatically reduce the business risk associated with change and increase business agility, ultimately delivering more business value (which is what change is about).

The key concepts of change management remain the same in the AWS Cloud. Change delivers business value and it should be efficient. Agile methodologies and the automation capabilities of the AWS Cloud go hand in hand with the core principles of change management as they are also designed to deliver business value quickly and efficiently. There are some key areas that may require existing change processes to be modified to adapt to new methods of delivering change.

AWS Systems Manager Change Manager is an enterprise change management framework for requesting, approving, implementing, and reporting on operational changes to your application configuration and infrastructure. From a single delegated administrator account, if you use AWS Organizations, you can manage changes across multiple AWS accounts and across AWS Regions. Alternatively, using a local account, you can manage changes for a single AWS account. Use Change Manager for managing changes to both AWS resources and on-premises resources.

Configuration items in the cloud

For example, if an application suffers a fault in a traditional IT environment where application updates and operating system patches are installed or deployed on a server, an engineer may be tasked to investigate and either apply a fix or deploy a new server. Either of these tasks would at least require an emergency change, and could put the business at risk for a significant amount of time. In the AWS Cloud, you can use Auto Scaling groups to automate this process. Failures can be automatically detected using predefined health checks, and servers can be automatically replaced with exactly the same configuration. This simple scenario shows the clear benefits of automation: human error is eliminated, configuration drift is eliminated, and business risk is minimized as the time to recover is dramatically reduced.

Auto Scaling groups can also be used to automatically provision additional resources to meet business demand. Again, in a traditional environment, the addition of servers may have required several business processes before approaching change management, and then a standard or normal change to implement the requirement. In the best case scenario, a lot of work was done to increase capacity; in the worst case scenario, the business was impacted and put at risk by all of the business processes required to introduce additional capacity, and it may not have been possible to meet the business demand in the timescales require.

With this example in mind, any manual approval steps required to recover from failure or scale capacity to meet business demand inherently introduce risk to the business. A change is considered to be the addition, modification, replacement, or removal of a configuration item. When approval is required to make a change to a configuration item, existing change management processes may forbid these automated scenarios. This scenario is where it may help to redefine what items are considered to be configuration items.

In the previous example, it is not the servers themselves that are the configuration items when they are in an Auto Scaling group, because they are transient and potentially non-configurable items. The Auto Scaling group and the image that is used to create the servers should be considered as the configuration items because they are the items that may put the business at risk if they are configured incorrectly.

To manage configuration items In the AWS Cloud, AWS Config can be used to assess, audit and evaluate the configuration of AWS resources allowing you to continuously monitor and record AWS resource configurations. With AWS Config, you can track the relationships among resources and review resource dependencies prior to making changes. Once a change occurs, you can quickly review the history of the resource's configuration and determine what the resource’s configuration looked like at any point in the past. AWS Config provides you with information to assess how a change to a resource configuration would affect your other resources, which minimizes the impact of change-related incidents.

AWS CloudFormation change sets enable you to preview how proposed changes to a stack might affect your running resources; for example, to check whether your changes will delete or replace any critical resources. AWS CloudFormation makes the changes to your stack only after you decide to deploy the Change Set.

Automation

Another key consideration is understanding the business risk when deploying in the AWS Cloud. Regardless of whether or not a deployment is an application, a patch, or a configuration change, an optimized cloud configuration can automate the deployment process through an unchanged pipeline. This ensures repeatability and consistency across multiple environments, as well as enabling automation of software testing, compliance testing, security testing, and functional testing.

Although this does not guarantee against a change having an adverse impact, it does allow risks to be reduced, and those automated processes should not be reconsidered for every change. It is the actual configuration change itself that should have the focus.

For example, if an automated security test is approved for deployment purposes, the security review during the change approval process can be dramatically reduced or even eliminated entirely in the appropriate circumstances. Repeatability and consistency throughout the lifecycle of a workload and its deployment should reduce the burden on the examination of changes by the Change Approval Board. The focus should be on how changes are delivered (the pipeline) and the automation of tests that can reduce manual testing and scrutiny by the Board, both of which are prone to human error.

AWS CodePipeline automates your software release process, enabling you to rapidly release new features to your users. With CodePipeline, you can quickly iterate on feedback and get new features to your users faster. Automating your build, test, and release process enables you to quickly and easily test each code change, and catch bugs while they are small and simple to fix. You can ensure the quality of your application or infrastructure code by running each change through your staging and release process.

EC2 Image Builder significantly reduces the effort of keeping images up-to-date and secure by providing a simple graphical interface, built-in automation, and AWS-provided security settings. With Image Builder, there are no manual steps for updating an image, nor do you have to build your own automation pipeline. Creating a golden image using EC2 Image Builder significantly reduces the risk of non-compliant images being used, as well as improving security, consistency and compliance.

Using AWS Systems Manager, you can automate operational tasks to help make your teams more efficient. With automated approval workflows and runbooks with rich text descriptions, you can reduce human error and simplify maintenance and deployment tasks on AWS resources. You can use predefined automation runbooks, or build your own to share for common operational tasks such as stopping and restarting an EC2 instance. Systems Manager also has built-in safety controls, enabling you to incrementally roll out new changes and automatically halt the roll-out if errors occur.

Remediation

Changes should not be approved without considering the consequences of a failure. “Ideally, there will be a back-out plan, which will restore the organization to its initial situation” (ITIL Service Transition, AXELOS, 2011, page 48).

The AWS Cloud enables back-out plans to be fully automated using repeatable processes. Not all changes are reversible and “remediation may require a revisiting of the change itself in the event of failure” (ITIL Service Transition, AXELOS, 2011, page 48).

Deployments in the AWS Cloud that use an automated pipeline allow changes to be redeployed quickly and safely, minimizing risk and reducing business impact. In certain scenarios, it may not be possible to back-out changes or redeploy, in which case it might be that “it requires invoking the organization’s business continuity plan.” (ITIL Service Transition, AXELOS, 2011, page 48)

Even in the most severe cases, using continuous data protection in the cloud can enable sub-second recovery point objectives (RPOs), and recovery time objectives (RTOs) can be measured in minutes. Refer to CloudEndure Disaster Recovery for more information. Crucially, where it’s not possible to back out changes, the AWS Cloud provides methods to significantly reduce business risk and impact of a failed change by making it quicker and easier to redeploy or invoke disaster recovery plans.

Modern deployment methods in the cloud allow for fast or instant rollback. For example, with blue/green deployments, you can make a change to a workload by deploying an identical copy (green) of the live environment (blue) with the configuration change. Users can then be switched to the new environment (green) while the old live environment (blue) remains available, but idle.

In this scenario, if a failure is discovered, users can be instantly redirected back to the blue environment, and the business impact is greatly reduced. It is also possible to combine this approach with a canary release that is easily enabled in the cloud. With this approach, you can redirect a subset of users to the new deployment, assess its efficacy and gradually increase the number of users on the new deployment until all users are using the new deployment.

There are other considerations when choosing a method of deployment, but the key for change management is the risk to the business of a change deployed in a manner like this is greatly reduced.

AWS CodeDeploy helps maximize your application availability during the software deployment process. It introduces changes incrementally and tracks application health according to configurable rules. Software deployments can easily be stopped and rolled back if there are errors.

AWS CloudFormation rollback triggers monitor the state of your application during stack creation and updating, and enable you to roll back that operation if the application breaches the threshold of any of the alarms you've specified. For each rollback trigger you create, you specify the CloudWatch alarm that AWS CloudFormation should monitor. AWS CloudFormation monitors the specified alarms during the stack create or update operation, and for the specified amount of time after all resources have been deployed. If any of the alarms goes to ALARM state during the stack operation or the monitoring period, AWS CloudFormation rolls back the entire stack operation.

AWS AppConfig supports best practices by rolling out configuration changes instantly or gradually. The configuration change is monitored over a time period that customers define. If you configure alarms in Amazon CloudWatch, AWS AppConfig can automatically rollback configuration changes in the event that those alarms are triggered.

Adapting change management to the cloud

There are two areas in which change process may need to be adapted. Because the risk and impact to the business of a failed change is greatly reduced, changes can be made more frequently and with more confidence in the rollback plan. As a result, the second area for consideration is the acceptance of rolling back changes.

If failed changes have a much lower impact due to the speed and consistency of roll back, activating roll backs should be considered to be part of the normal process. This is particularly true if it is possible to quickly remediate the issue and push it through the same automated pipelines to quickly deliver the original intended business value of the change.

With these considerations in mind, if automation, pipelines, and deployment methods are in place, it may be possible to reconsider the approach to standard changes. A standard change is where there is a defined trigger to initiate the change request. In addition, in a standard change, actions are well known, documented, and proven; authority is given in advance (or pre-authorized); and the risk is usually low. If the appropriate automation, testing, and deployment strategies are put in place, it should result in a scenario where large, infrequent, and risky changes are transformed in to small, frequent, low risk changes.

By understanding the risk-reduction strategies that are enabled by the AWS Cloud, it should be possible and it may even be necessary to widen the scope of a standard change to include deployments that would have previously been considered as normal changes due to the risks associated with them in traditional IT environments.

As changes become more frequent due to agile methodologies and increased automation, there is a risk that change management becomes overburdened with normal changes which can lead to delaying changes due to bandwidth constraints or, important details are missed as changes are not properly scrutinized due to resource constraints. Both of these scenarios introduce business risk which change management aims to optimize. In an environment of small, frequent changes, standard changes should become the new normal so proper scrutiny can be given to normal changes, optimizing business risk and enabling the delivery of business value.

A reduction in the size of a change reduces the risk of disruption. Smaller changes also mean that change can happen more frequently. By changing more frequently, the organizations capability of changing is improved. Increased capability of change leads, in turn, to lower risk of disruption (ITIL 4: High-Velocity IT, AXELOS, 2020, page 89).

You can use deployment tools such as AWS CodePipeline, a continuous integration and continuous delivery service which define release process workflow. CodePipeline can pull source code directly from a repository, run builds and unit tests in AWS CodeBuild and then deploy changes into staging or production environment using AWS CodeDeploy if build and test cases are successful. CodePipeline automates the entire release process which reduces manual errors, increasing agility and business value allowing you to optimize business risk.