Incident-Management mit Incident Detection and Response

AWS Incident Detection and Response bietet Ihnen rund um die Uhr proaktive Überwachung und Verwaltung von Vorfällen, die von einem dafür vorgesehenen Team von Incident-Managern bereitgestellt werden. Das folgende Diagramm beschreibt den Standardprozess für das Incident-Management, wenn ein Anwendungsalarm einen Vorfall auslöst, einschließlich der Alarmerzeugung, der Einbindung des AWS Incident Managers, der Behebung von Vorfällen und der Überprüfung nach dem Vorfall.

Standardablaufdiagramm für den Ablauf des Vorfallmanagements

Generierung von Alarmen: Bei Ihren Workloads ausgelöste Alarme werden über Amazon EventBridge an AWS Incident Detection and Response weitergeleitet. AWS Incident Detection and Response ruft automatisch das mit Ihrem Alarm verknüpfte Runbook auf und benachrichtigt einen Incident Manager. Wenn auf Ihrem Workload ein kritischer Vorfall auftritt, der nicht durch Alarme erkannt wird, die von AWS Incident Detection and Response überwacht werden, können Sie einen Support-Fall erstellen, um eine Incident Response anzufordern. Weitere Informationen zur Anforderung einer Incident Response finden Sie unterFordern Sie eine Antwort auf einen Vorfall an.
AWS Engagement des Incident Managers: Der Incident Manager reagiert auf den Alarm und lädt Sie zu einer Telefonkonferenz ein oder wie im Runbook anderweitig angegeben. Der Incident Manager überprüft den Zustand der, AWS-Services um festzustellen, ob der Alarm auf Probleme zurückzuführen ist, die vom Workload AWS-Services genutzt wurden, und berät Sie über den Status der zugrunde liegenden Dienste. Falls erforderlich, erstellt der Incident Manager dann in Ihrem Namen einen Fall und beauftragt die richtigen AWS Experten mit der Unterstützung.

Da AWS Incident Detection and Response AWS-Services speziell Ihre Anwendungen überwacht, kann AWS Incident Detection and Response feststellen, dass der Vorfall mit einem AWS-Service Problem zusammenhängt, noch bevor ein AWS-Service Ereignis gemeldet wird. In diesem Szenario berät Sie der Incident Manager über den Status des AWS-Service, löst den AWS Service Event Incident Management Flow aus und setzt sich mit dem Serviceteam in Verbindung, um eine Lösung zu finden. Die bereitgestellten Informationen geben Ihnen die Möglichkeit, Ihre Wiederherstellungspläne oder Behelfslösungen frühzeitig umzusetzen, um die Auswirkungen des AWS Serviceereignisses zu minimieren. Weitere Informationen finden Sie unter Verwaltung von Vorfällen bei Serviceereignissen.
Behebung von Vorfällen: Der Incident Manager koordiniert den Vorfall zwischen den erforderlichen AWS Teams und stellt sicher, dass Sie mit den richtigen AWS Experten zusammenarbeiten, bis der Vorfall gemildert oder gelöst ist.
Überprüfung nach dem Vorfall (falls gewünscht): Nach einem Vorfall kann AWS Incident Detection and Response auf Anfrage eine Überprüfung nach dem Vorfall durchführen und einen Bericht nach dem Vorfall erstellen. Der Bericht nach dem Vorfall enthält eine Beschreibung des Problems, der Auswirkungen, der beteiligten Teams und der zur Minderung oder Lösung des Vorfalls ergriffenen Abhilfemaßnahmen oder Maßnahmen. Der Bericht nach dem Vorfall kann Informationen enthalten, die verwendet werden können, um die Wahrscheinlichkeit eines erneuten Auftretens eines Vorfalls zu verringern oder das Management eines future Auftretens eines ähnlichen Vorfalls zu verbessern. Der Bericht nach dem Vorfall ist keine Ursachenanalyse (Root Cause Analysis, RCA). Sie können zusätzlich zum Bericht nach dem Vorfall eine RCA anfordern. Ein Beispiel für einen Bericht nach einem Vorfall finden Sie im folgenden Abschnitt.

Wichtig

Die folgende Berichtsvorlage ist nur ein Beispiel.


Post ** Incident ** Report ** Template
Post Incident Report - 0000000123
Customer: Example Customer
AWS Support case ID(s): 0000000000
Customer internal case ID (if provided): 1234567890
Incident start: 2023-02-04T03:25:00 UTC
Incident resolved: 2023-02-04T04:27:00 UTC
Total Incident time: 1:02:00 s
Source Alarm ARN: arn:aws:cloudwatch:us-east-1:000000000000:alarm:alarm-prod-workload-impaired-useast1-P95 

Problem Statement:
Outlines impact to end users and operational infrastructure impact.
 Starting at 2023-02-04T03:25:00 UTC, the customer experienced a large scale outage of their workload that lasted one hour and two minutes and spanning across all Availability Zones where the application is deployed. During impact, end users were unable to connect to the workload's Application Load Balancers (ALBs) which service inbound communications to the application. 

Incident Summary:

Summary of the incident in chronological order and steps taken by AWS Incident Managers to direct the incident to a path to mitigation.
  At 2023-02-04T03:25:00 UTC, the workload impairments alarm triggered a critical incident for the workload. AWS Incident Detection and Response Managers responded to the alarm, checking AWS service health and steps outlined in the workload’s runbook. 
  At 2023-02-04T03:28:00 UTC, ** per the runbook, the alarm had not recovered and the Incident Management team sent the engagement email to the customer’s Site Reliability Team (SRE) team, created a troubleshooting bridge, and an Support support case on behalf of the customer. 
  At 2023-02-04T03:32:00 UTC, ** the customer’s SRE team, and Support Engineering joined the bridge. The Incident Manager confirmed there was no on-going AWS impact to services the workload depends on. The investigation shifted to the specific resources in the customer account. 
  At 2023-02-04T03:45:00 UTC, the Cloud Support Engineer discovered a sudden increase in traffic volume was causing a drop in connections. The customer confirmed this ALB was newly provisioned to handle an increase in workload traffic for an on-going promotional event. 
  At 2023-02-04T03:56:00 UTC, the customer instituted back off and retry logic. The Incident Manager worked with the Cloud Support Engineer to raise an escalation a higher support level to quickly scale the ALB per the runbook.
  At 2023-02-04T04:05:00 UTC, ALB support team initiates scaling activities. The back-off/retry logic yields mild recovery but timeouts are still being seen for some clients. 
 By 2023-02-04T04:15:00 UTC, scaling activities complete and metrics/alarms return to pre-incident levels. Connection timeouts subside. 
  At 2023-02-04T04:27:00 UTC, per the runbook the call was spun down, after 10 minutes of recovery monitoring. Full mitigation is agreed upon between AWS and the customer. 

Mitigation:
Describes what was done to mitigate the issue. NOTE: this is not a Root Cause Analysis (RCA).
  Back-off and retries yielded mild recovery. Full mitigation happened after escalation to ALB support team (per runbook) to scale the newly provisioned ALB. 

Follow up action items (if any):
Action items to be reviewed with your Technical Account Manager (TAM), if required.
Review alarm thresholds to engage AWS Incident Detection and Response closer to the time of impact.
Work with AWS -Support and TAM team to ensure newly created ALBs are pre-scaled to accommodate expected spikes in workload traffic.

Themen

Warnung JavaScript ist in Ihrem Browser nicht verfügbar oder deaktiviert.

Zur Nutzung der AWS-Dokumentation muss JavaScript aktiviert sein. Weitere Informationen finden auf den Hilfe-Seiten Ihres Browsers.

Dokumentkonventionen

Überwachung und Beobachtbarkeit

Stellen Sie den Zugriff für Anwendungsteams bereit