OPS10-BP01 Use a process for event, incident, and problem management - AWS Well-Architected Framework

OPS10-BP01 Use a process for event, incident, and problem management

The ability to efficiently manage events, incidents, and problems is key to maintaining workload health and performance. It's crucial to recognize and understand the differences between these elements to develop an effective response and resolution strategy. Establishing and following a well-defined process for each aspect helps your team swiftly and effectively handle any operational challenges that arise.

Desired outcome: Your organization effectively manages operational events, incidents, and problems through well-documented and centrally stored processes. These processes are consistently updated to reflect changes, streamlining handling and maintaining high service reliability and workload performance.

Common anti-patterns:

  • You reactively, rather than proactively, respond to events.

  • Inconsistent approaches are taken to different types of events or incidents.

  • Your organization does not analyze and learn from incidents to prevent future occurrences.

Benefits of establishing this best practice:

  • Streamlined and standardized response processes.

  • Reduced impact of incidents on services and customers.

  • Expedited issue resolution.

  • Continuous improvement in operational processes.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Implementing this best practice means you are tracking workload events. You have processes to handle incidents and problems. The processes are documented, shared, and updated frequently. Problems are identified, prioritized, and fixed.

Understanding events, incidents, and problems

  • Events: An event is an observation of an action, occurrence, or change of state. Events can be planned or unplanned and they can originate internally or externally to the workload.

  • Incidents: Incidents are events that require a response, like unplanned interruptions or degradations of service quality. They represent disruptions that need immediate attention to restore normal workload operation.

  • Problems: Problems are the underlying causes of one or more incidents. Identifying and resolving problems involves digging deeper into the incidents to prevent future occurrences.

Implementation steps

Events

  1. Monitor events:

  2. Create processes:

    • Develop a process to assess which events are significant and require monitoring. This involves setting thresholds and parameters for normal and abnormal activities.

    • Determine criteria escalating an event to an incident. This could be based on the severity, impact on users, or deviation from expected behavior.

    • Regularly review the event monitoring and response processes. This includes analyzing past incidents, adjusting thresholds, and refining alerting mechanisms.

Incidents

  1. Respond to incidents:

    • Use insights from observability tools to quickly identify and respond to incidents.

    • Implement AWS Systems Manager Ops Center to aggregate, organize, and prioritize operational items and incidents.

    • Use services like Amazon CloudWatch and AWS X-Ray for deeper analysis and troubleshooting.

    • Consider AWS Managed Services (AMS) for enhanced incident management, leveraging its proactive, preventative, and detective capabilities. AMS extends operational support with services like monitoring, incident detection and response, and security management.

    • Enterprise Support customers can use AWS Incident Detection and Response, which provides continual proactive monitoring and incident management for production workloads.

  2. Create an incident management process:

    • Establish a structured incident management process, including clear roles, communication protocols, and steps for resolution.

    • Integrate incident management with tools like AWS Chatbot for efficient response and coordination.

    • Categorize incidents by severity, with predefined incident response plans for each category.

  3. Learn and improve:

    • Conduct post-incident analysis to understand root causes and resolution effectiveness.

    • Continually update and improve response plans based on reviews and evolving practices.

    • Document and share lessons learned across teams to enhance operational resilience.

    • Enterprise Support customers can request the Incident Management Workshop from their Technical Account Manager. This guided workshop tests your existing incident response plan and helps you identify areas for improvement.

Problems

  1. Identify problems:

    • Use data from previous incidents to identify recurring patterns that may indicate deeper systemic issues.

    • Leverage tools like AWS CloudTrail and Amazon CloudWatch to analyze trends and uncover underlying problems.

    • Engage cross-functional teams, including operations, development, and business units, to gain diverse perspectives on the root causes.

  2. Create a problem management process:

    • Develop a structured process for problem management, focusing on long-term solutions rather than quick fixes.

    • Incorporate root cause analysis (RCA) techniques to investigate and understand the underlying causes of incidents.

    • Update operational policies, procedures, and infrastructure based on findings to prevent recurrence.

  3. Continue to improve:

    • Foster a culture of constant learning and improvement, encouraging teams to proactively identify and address potential problems.

    • Regularly review and revise problem management processes and tools to align with evolving business and technology landscapes.

    • Share insights and best practices across the organization to build a more resilient and efficient operational environment.

  4. Engage AWS Support:

    • Use AWS support resources, such as AWS Trusted Advisor, for proactive guidance and optimization recommendations.

    • Enterprise Support customers can access specialized programs like AWS Countdown for support during critical events.

Level of effort for the implementation plan: Medium

Resources

Related best practices:

Related documents:

Related videos:

Related examples:

Related services: