OPS10-BP01 Use a process for event, incident, and problem management - Operational Excellence Pillar

OPS10-BP01 Use a process for event, incident, and problem management

Your organization has processes to handle events, incidents, and problems. Events are things that occur in your workload but may not need intervention. Incidents are events that require intervention. Problems are recurring events that require intervention or cannot be resolved. You need processes to mitigate the impact of these events on your business and make sure that you respond appropriately.

When incidents and problems happen to your workload, you need processes to handle them. How will you communicate the status of the event with stakeholders? Who oversees leading the response? What are the tools that you use to mitigate the event? These are examples of some of the questions you need answer to have a solid response process.

Processes must be documented in a central location and available to anyone involved in your workload. If you don’t have a central wiki or document store, a version control repository can be used. You’ll keep these plans up to date as your processes evolve.

Problems are candidates for automation. These events take time away from your ability to innovate. Start with building a repeatable process to mitigate the problem. Over time, focus on automating the mitigation or fixing the underlying issue. This frees up time to devote to making improvements in your workload.

Desired outcome: Your organization has a process to handle events, incidents, and problems. These processes are documented and stored in a central location. They are updated as processes change.

Common anti-patterns:

  • An incident happens on the weekend and the on-call engineer doesn’t know what to do.

  • A customer sends you an email that the application is down. You reboot the server to fix it. This happens frequently.

  • There is an incident with multiple teams working independently to try to solve it.

  • Deployments happen in your workload without being recorded.

Benefits of establishing this best practice:

  • You have an audit trail of events in your workload.

  • Your time to recover from an incident is decreased.

  • Team members can resolve incidents and problems in a consistent manner.

  • There is a more consolidated effort when investigating an incident.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Implementing this best practice means you are tracking workload events. You have processes to handle incidents and problems. The processes are documented, shared, and updated frequently. Problems are identified, prioritized, and fixed.

Customer example

AnyCompany Retail has a portion of their internal wiki devoted to processes for event, incident, and problem management. All events are sent to Amazon EventBridge. Problems are identified as OpsItems in AWS Systems Manager OpsCenter and prioritized to fix, reducing undifferentiated labor. As processes change, they’re updated in their internal wiki. They use AWS Systems Manager Incident Manager to manage incidents and coordinate mitigation efforts.

Implementation steps

  1. Events

    • Track events that happen in your workload, even if no human intervention is required.

    • Work with workload stakeholders to develop a list of events that should be tracked. Some examples are completed deployments or successful patching.

    • You can use services like Amazon EventBridge or Amazon Simple Notification Service to generate custom events for tracking.

  2. Incidents

    • Start by defining the communication plan for incidents. What stakeholders must be informed? How will you keep them in the loop? Who oversees coordinating efforts? We recommend standing up an internal chat channel for communication and coordination.

    • Define escalation paths for the teams that support your workload, especially if the team doesn’t have an on-call rotation. Based on your support level, you can also file a case with AWS Support.

    • Create a playbook to investigate the incident. This should include the communication plan and detailed investigation steps. Include checking the AWS Health Dashboard in your investigation.

    • Document your incident response plan. Communicate the incident management plan so internal and external customers understand the rules of engagement and what is expected of them. Train your team members on how to use it.

    • Customers can use Incident Manager to set up and manage their incident response plan.

    • Enterprise Support customers can request the Incident Management Workshop from their Technical Account Manager. This guided workshop tests your existing incident response plan and helps you identify areas for improvement.

  3. Problems

    • Problems must be identified and tracked in your ITSM system.

    • Identify all known problems and prioritize them by effort to fix and impact to workload.

      
                Action priority matrix for prioritizing problems.
    • Solve problems that are high impact and low effort first. Once those are solved, move on to problems to that fall into the low impact low effort quadrant.

    • You can use Systems Manager OpsCenter to identify these problems, attach runbooks to them, and track them.

Level of effort for the implementation plan: Medium. You need both a process and tools to implement this best practice. Document your processes and make them available to anyone associated with the workload. Update them frequently. You have a process for managing problems and mitigating them or fixing them.

Resources

Related best practices:

Related documents:

Related videos:

Related examples:

Related services: