Workload onboarding and alarm ingestion questionnaires - AWS Incident Detection and Response User Guide

Workload onboarding and alarm ingestion questionnaires

Download the Workload onboarding questionnaire.

Download the Alarm ingestion questionnaire.

Workload onboarding questionnaire - General questions

General questions
Question Example Response
Enterprise Name

Amazon Inc.

Name of this workload (include any abbreviations)

Amazon Retail Operations (ARO)

Primary end user and the function of this workload.

This workload is an e-commerce application that allows end users to purchase various items. This workload is the primary revenue generator for our business.

Applicable compliance and/or regulatory requirements for this workload and any actions required from AWS after an incident.

The workload deals with patient health records which must be kept secured and confidential.

Workload onboarding questionnaire - Architecture questions

Architecture questions
Question Example Response

A list of AWS resource tags used to define resources that are part of this workload. AWS uses these tags to identify this workload's resources to expedite support during incidents.

Note

Tags are case sensitive. If you provide multiple tags, all resources used by this workload must have the same tags.

appName: Optimax

environment: Production

A list of AWS Services utilized by this workload and the AWS Account and Regions that they're in.

Note

Create a new row for each service.

Route 53: Routes internet traffic to the ALB.

Account:123456789101

Region: US-EAST-1, US-WEST-2

A list of AWS Services utilized by this workload and the AWS Account and Regions that they're in.

Note

Create a new row for each service.

ALB: Routes incoming traffic to a target group of ECS containers.

Account: 123456789101

Region: N/A

A list of AWS Services utilized by this workload and the AWS Account and Regions that they're in.

Note

Create a new row for each service.

ECS: Compute infrastructure for main business logic fleet. Responsible for handling incoming user requests and making queries to persistence layer.

Account: 123456789101

Region: US-EAST-1

A list of AWS Services utilized by this workload and the AWS Account and Regions that they're in.

Note

Create a new row for each service.

RDS: Amazon Aurora cluster stores user data accessed by ECS business logic layer.

Account: 123456789101

Region: US-EAST-1

A list of AWS Services utilized by this workload and the AWS Account and Regions that they're in.

Note

Create a new row for each service.

S3: Stores website static assets.

Account: 123456789101

Region: N/A

Detail any upstream/downstream components not being onboarded that could affect this workload if experiencing an outage. Authentication Microservice: Will prevent users from loading their health records as they will be unauthenticated.
Are there any on-premise or non-AWS components for this workload? If so, what are they and what functions are performed? All internet based traffic in/out of AWS is routed via our on-prem proxy service.
Provide details of any manual or automated failover/disaster recovery plans at the Availability Zone and regional level. Warm standby. Automated failover to US-WEST-2 during sustained drop in success rate.

Workload onboarding questionnaire - AWS Service Event questions

AWS Service Event questions
Question Example Response
Provide the contact details (name/email/phone) of your company's internal major incident/IT crisis management team.

Major Incident Management Team

mim@example.com

+61 2 3456 7890

Provide details of any static incident/crisis management bridge established by your company. If you utilize non-static bridges, then specify your preferred application and AWS will request these details during an incident.

Note

If one isn't provided, then AWS will reach out during an incident and provide a Chime bridge for you to join.

Amazon Chime

https://chime.aws/1234567890

Alarm Ingestion Questionnaire

Runbook Questions
Question Example Response

AWS will engage workload contacts through the AWS Support Case. Who is the primary contact when an alarm triggers for this workload?

Specify your preferred conferencing application and AWS will request these details during an incident.

Note

If a preferred conferencing application isn't provided, then AWS will reach out during an incident and provide a Chime bridge for you to join.

Application Team

app@example.com

+61 2 3456 7890

If the primary contact is unavailable during an incident, please provide escalation contacts and timeline in the preferred communication order.

1. After 10 minutes, if no response from Primary Contact, engage:

John Smith - Application Supervisor

john.smith@example.com

+61 2 3456 7890

2. After 10 minutes, if no response from John Smith, contact:

Jane Smith - Operations Manager

jane.smith@example.com

+61 2 3456 7890

AWS communicates updates through the support case at regular intervals throughout the incident. Are there additional contacts that should receive these updates?

john.smith@example.com, jane.smith@example.com

Alarm matrix

Alarm Matrix

Provide the following information to identify the set of alarms that will engage AWS Incident Detection and Response to create incidents on behalf of your workload. Once engineers from AWS Incident Detection and Response have reviewed your alarms additional onboarding steps will be delivered.

AWS Incident Detection and Response Critical Alarm Criteria:

  • AWS Incident Detection and Response alarms should only enter "Alarm" state upon significant business impact to the monitored workload (loss of revenue/degraded customer experience) that requires immediate operator attention.

  • AWS Incident Detection and Response alarms must also engage your resolvers for the workload at the same time or prior to engagement. AWS Incident Managers collaborate with your resolvers in the mitigation process, and do not serve as a first-line responders who then escalate to you.

  • AWS Incident Detection and Response alarm thresholds must be set to an appropriate threshold and duration so that any time an alarm fires an investigation must take place. If an alarm is moving between the "Alarm" and "OK" state, sufficient impact is occurring to warrant operator response and attention.

AWS Incident Detection and Response Policy for Criteria Violations:

These criteria can only be evaluated on a case-by-case basis as events occur. The Incident Management team works with your technical account managers (TAMs) to adjust alarms and in rare cases disable monitoring if it is suspected that customer alarms do not adhere to this criteria and is engaging the Incident Management team unnecessarily at a regular rate.

Important

Provide a group distribution email addresses when supplying contact addresses, so that you can control recipient additions and deletions without runbook updates.

Provide the contact phone number for your site reliability engineering (SRE) team if you would like the AWS Incident Detection and Response team to call them after sending an initial engagement email.

Alarm Matrix table
Metric name / ARN / Threshold Description Notes Actions requested

Workload volume /

CW Alarm ARN /

CallCount < 100000 for 5 datapoints within 5 minute , treat missing data as missing

This metric represents the number of incoming requests coming to the workload, measured at the Application Load Balancer level.

This alarm is important because significant drops in incoming requests may indicate issues with upstream network connectivity, or issues with our DNS implementation that result in users not being able to access the workload.

The alarm has entered the "Alarm" state 10 times in the last week. This alarm is at risk of false positives. Threshold review is planned.

Issues? No or Yes (if No, leave blank): This alarm flips frequently during a particular batch job execution.

Resolvers: Site Reliability Engineers

Engage the Site Reliability Engineering team by sending an email to SRE@xyz.com

Create an AWS Premimum Support case for our ELB, and Route 53 services.

If IMMEDIATE action is needed: Check EC2 Free memory/disk space and inform the XYZ Team through email to restart the instance, or run a log flush. (if immediate action is not needed, leave blank)

Workload Request Latency /

CW Alarm ARN /

p90 Latency > 100ms for 5 datapoints within 5 minutes , treat missing data as missing

This metric represents the p90 latency for HTTP requests to be fulfilled by the workload.

This alarm represents latency (important measure of customer experience for the website).

The alarm has entered the "Alarm" state 0 times in the last week.

Issues? No or Yes (if No, leave blank): This alarm flips frequently during a particular batch job execution.

Resolvers: Site Reliability Engineers

Engage the Site Reliability Engineering team by sending an email to SRE@xyz.com

Create an AWS Premimum Support case for our ECW, and RDS services.

If IMMEDIATE action is needed: Check EC2 Free memory/disk space and inform the XYZ Team through email to restart the instance, or run a log flush. (if immediate action is not needed, leave blank)

Workload Request Availability /

CW Alarm ARN /

Availability < 95% for 5 datapoints within 5 minutes , treat missing data as missing.

This metric represents the availability for HTTP requests to be fulfilled by the workload. (# of HTTP 200 / # of Requests) per period.

This alarm represents the availability of the workload.

The alarm has entered the "Alarm" state 0 times in the last week.

Issues? No or Yes (if No, leave blank): This alarm flips frequently during a particular batch job execution.

Resolvers: Site Reliability Engineers

Engage the Site Reliability Engineering team by sending an email to SRE@xyz.com

Create an AWS Premimum Support case for our ELB, and Route 53 services.

If IMMEDIATE action is needed: Check EC2 Free memory/disk space and inform the XYZ Team through email to restart the instance, or run a log flush. (if immediate action is not needed, leave blank)

 

New Relic Alarm Example

End to End Integration test /

CW Alarm ARN /

3% failure rate for 1 minute metrics over 3 minutes duration , treat missing data as missing

Workload Identifier: End to End Test Workflow, AWS Region: US-EAST-1, AWS Account ID: 012345678910

This metric tests if a request can traverse each layer of the workload. If this test fails, it represents a critical failure to process business transactions.

This alarm represents the ability to process business transactions for the workload.

The alarm has entered the "Alarm" state 0 times in the last week.

Issues? No or Yes (if No, leave blank): This alarm flips frequently during a particular batch job execution.

Resolvers: Site Reliability Engineers

Engage the Site Reliability Engineering team by sending an email to SRE@xyz.com

Create an AWS Premimum Support case for our ECS, and DynamoDB services.

If IMMEDIATE action is needed: Check EC2 Free memory/disk space and inform the XYZ Team through email to restart the instance, or run a log flush. (if immediate action is not needed, leave blank)