Workload onboarding and alarm ingestion questionnaires in Incident Detection and Response
This page provides the questionnaires you need to complete when onboarding a workload to AWS Incident Detection and Response and when configuring alarms to ingest into the service. The workload onboarding questionnaire covers general information about your workload, its architecture details, and contacts for incident response. In the alarm ingestion questionnaire, you specify the critical alarms that should trigger incident creation in Incident Detection and Response for your workload, as well as runbook information on who should be contacted and what actions should be taken. Properly completing these questionnaires is a key step in setting up monitoring and incident response processes for your AWS workloads.
Download the Workload onboarding questionnaire
Download the Alarm ingestion questionnaire
Workload onboarding questionnaire - General questions
Question | Example Response |
---|---|
Enterprise Name | Amazon Inc. |
Name of this workload (include any abbreviations) | Amazon Retail Operations (ARO) |
Primary end user and the function of this workload. | This workload is an e-commerce application that allows end users to purchase various items. This workload is the primary revenue generator for our business. |
Applicable compliance and/or regulatory requirements for this workload and any actions required from AWS after an incident. | The workload deals with patient health records which must be kept secured and confidential. |
Workload onboarding questionnaire - Architecture questions
Question | Example Response |
---|---|
A list of AWS resource tags used to define resources that are part of this workload. AWS uses these tags to identify this workload's resources to expedite support during incidents. NoteTags are case sensitive. If you provide multiple tags, all resources used by this workload must have the same tags. |
appName: Optimax environment: Production |
A list of AWS Services utilized by this workload and the AWS Account and Regions that they're in. NoteCreate a new row for each service. |
Route 53: Routes internet traffic to the ALB. Account:123456789101 Region: US-EAST-1, US-WEST-2 |
A list of AWS Services utilized by this workload and the AWS Account and Regions that they're in. NoteCreate a new row for each service. |
ALB: Routes incoming traffic to a target group of ECS containers. Account: 123456789101 Region: N/A |
A list of AWS Services utilized by this workload and the AWS Account and Regions that they're in. NoteCreate a new row for each service. |
ECS: Compute infrastructure for main business logic fleet. Responsible for handling incoming user requests and making queries to persistence layer. Account: 123456789101 Region: US-EAST-1 |
A list of AWS Services utilized by this workload and the AWS Account and Regions that they're in. NoteCreate a new row for each service. |
RDS: Amazon Aurora cluster stores user data accessed by ECS business logic layer. Account: 123456789101 Region: US-EAST-1 |
A list of AWS Services utilized by this workload and the AWS Account and Regions that they're in. NoteCreate a new row for each service. |
S3: Stores website static assets. Account: 123456789101 Region: N/A |
Detail any upstream/downstream components not being onboarded that could affect this workload if experiencing an outage. | Authentication Microservice: Will prevent users from loading their health records as they will be unauthenticated. |
Are there any on-premise or non-AWS components for this workload? If so, what are they and what functions are performed? | All internet based traffic in/out of AWS is routed via our on-prem proxy service. |
Provide details of any manual or automated failover/disaster recovery plans at the Availability Zone and regional level. | Warm standby. Automated failover to US-WEST-2 during sustained drop in success rate. |
Workload onboarding questionnaire - AWS Service Event questions
Question | Example Response |
---|---|
Provide the contact details (name/email/phone) of your company's internal major incident/IT crisis management team. | Major Incident Management Team mim@example.com +61 2 3456 7890 |
Provide details of any static incident/crisis management bridge established by your company. If you utilize non-static bridges, then specify your preferred application and AWS will request these details during an incident. NoteIf one isn't provided, then AWS will reach out during an incident and provide a Chime bridge for you to join. |
Amazon Chime https://chime.aws/1234567890 |
Alarm ingestion questionnaire
Question | Example Response |
---|---|
AWS will engage workload contacts through the AWS Support Case. Who is the primary contact when an alarm triggers for this workload? Specify your preferred conferencing application and AWS will request these details during an incident. NoteIf a preferred conferencing application isn't provided, then AWS will reach out during an incident and provide a Chime bridge for you to join. |
Application Team app@example.com +61 2 3456 7890 |
If the primary contact is unavailable during an incident, please provide escalation contacts and timeline in the preferred communication order. |
1. After 10 minutes, if no response from Primary Contact, engage: John Smith - Application Supervisor john.smith@example.com +61 2 3456 7890 2. After 10 minutes, if no response from John Smith, contact: Jane Smith - Operations Manager jane.smith@example.com +61 2 3456 7890 |
AWS communicates updates through the support case at regular intervals throughout the incident. Are there additional contacts that should receive these updates? |
john.smith@example.com, jane.smith@example.com |
Alarm matrix
Provide the following information to identify the set of alarms that will engage AWS Incident Detection and Response to create incidents on behalf of your workload. Once engineers from AWS Incident Detection and Response have reviewed your alarms, additional onboarding steps will be delivered.
AWS Incident Detection and Response Critical Alarm Criteria:
AWS Incident Detection and Response alarms should only enter "Alarm" state upon significant business impact to the monitored workload (loss of revenue/degraded customer experience) that requires immediate operator attention.
AWS Incident Detection and Response alarms must also engage your resolvers for the workload at the same time or prior to engagement. AWS Incident Managers collaborate with your resolvers in the mitigation process, and do not serve as a first-line responders who then escalate to you.
AWS Incident Detection and Response alarm thresholds must be set to an appropriate threshold and duration so that any time an alarm fires an investigation must take place. If an alarm is moving between the "Alarm" and "OK" state, sufficient impact is occurring to warrant operator response and attention.
AWS Incident Detection and Response Policy for Criteria Violations:
These criteria can only be evaluated on a case-by-case basis as events occur. The Incident Management team works with your technical account managers (TAMs) to adjust alarms and in rare cases disable monitoring if it is suspected that customer alarms do not adhere to this criteria and is engaging the Incident Management team unnecessarily at a regular rate.
Important
Provide a group distribution email addresses when supplying contact addresses, so that you can control recipient additions and deletions without runbook updates.
Provide the contact phone number for your site reliability engineering (SRE) team if you would like the AWS Incident Detection and Response team to call them after sending an initial engagement email.
Metric name / ARN / Threshold | Description | Notes | Actions requested |
---|---|---|---|
Workload volume /
CallCount < 100000 for 5 datapoints within 5 minute , treat missing data as missing |
This metric represents the number of incoming requests coming to the workload, measured at the Application Load Balancer level. This alarm is important because significant drops in incoming requests may indicate issues with upstream network connectivity, or issues with our DNS implementation that result in users not being able to access the workload. |
The alarm has entered the "Alarm" state 10 times in the last week. This alarm is at risk of false positives. Threshold review is planned. Issues? No or Yes (if No, leave blank): This alarm flips frequently during a particular batch job execution. Resolvers: Site Reliability Engineers |
Engage the Site Reliability Engineering team by sending an email to Create an AWS Premimum Support case for our ELB, and Route 53 services. If IMMEDIATE action is needed: Check EC2 Free memory/disk space and inform the
|
Workload Request Latency /
p90 Latency > 100ms for 5 datapoints within 5 minutes , treat missing data as missing |
This metric represents the p90 latency for HTTP requests to be fulfilled by the workload. This alarm represents latency (important measure of customer experience for the website). |
The alarm has entered the "Alarm" state 0 times in the last week. Issues? No or Yes (if No, leave blank): This alarm flips frequently during a particular batch job execution. Resolvers: Site Reliability Engineers |
Engage the Site Reliability Engineering team by sending an email to Create an AWS Premimum Support case for our ECW, and RDS services. If IMMEDIATE action is needed: Check EC2 Free memory/disk space and inform the
|
Workload Request Availability /
Availability < 95% for 5 datapoints within 5 minutes , treat missing data as missing. |
This metric represents the availability for HTTP requests to be fulfilled by the workload. (# of HTTP 200 / # of Requests) per period. This alarm represents the availability of the workload. |
The alarm has entered the "Alarm" state 0 times in the last week. Issues? No or Yes (if No, leave blank): This alarm flips frequently during a particular batch job execution. Resolvers: Site Reliability Engineers |
Engage the Site Reliability Engineering team by sending an email to Create an AWS Premimum Support case for our ELB, and Route 53 services. If IMMEDIATE action is needed: Check EC2 Free memory/disk space and inform the
|
| |||
New Relic Alarm Example | |||
End to End Integration test /
3% failure rate for 1 minute metrics over 3 minutes duration , treat missing data as missing Workload Identifier: End to End Test Workflow, AWS Region: US-EAST-1, AWS Account ID: 012345678910 |
This metric tests if a request can traverse each layer of the workload. If this test fails, it represents a critical failure to process business transactions. This alarm represents the ability to process business transactions for the workload. |
The alarm has entered the "Alarm" state 0 times in the last week. Issues? No or Yes (if No, leave blank): This alarm flips frequently during a particular batch job execution. Resolvers: Site Reliability Engineers |
Engage the Site Reliability Engineering team by sending an email to Create an AWS Premimum Support case for our ECS, and DynamoDB services. If IMMEDIATE action is needed: Check EC2 Free memory/disk space and inform the
|