View a markdown version of this page

Appendix: AWS service-specific FMEA implementation - AWS Prescriptive Guidance

Appendix: AWS service-specific FMEA implementation

This reference guide provides example pre-analyzed failure modes for common AWS services, including Risk Priority Number (RPN) calculations and recommended mitigation strategies. Use this as a starting point for your FMEA analysis and customize based on your specific application architecture and business requirements.

Application Load Balancer

Application Load Balancer high-priority failure modes affect traffic routing and availability at the edge of your application. Health check misconfigurations can silently pull healthy targets out of rotation, expired SSL certificates can make your site unreachable, and listener rule errors can send requests to the wrong backend.

Target health check failures

  • Severity: 8 (Service unavailability)

  • Occurrence: 4 (Common with application issues)

  • Detection: 5 (Detected through Application Load Balancer monitoring)

  • RPN: 160

  • Root causes: Application startup delays, health check endpoint issues, network connectivity

  • Mitigation strategies:

SSL certificate expiration

Listener rule misconfigurations

  • Severity: 7 (Traffic routing issues)

  • Occurrence: 4 (Common during configuration changes)

  • Detection: 6 (Detected through monitoring or user reports)

  • RPN: 168

  • Root causes: Manual configuration errors, rule priority conflicts, condition mismatches

  • Mitigation strategies:

    • Implement infrastructure as code (IaC) for Application Load Balancer configurations

    • Create listener rule validation procedures

    • Establish configuration change management

    • Configure traffic routing monitoring

Amazon ECS

Amazon Elastic Container Service (Amazon ECS) high-priority failure modes typically involve tasks failing to launch, scaling policies not responding to load, or misconfigurations in task definitions. These issues can be difficult to detect because Amazon ECS abstracts much of the underlying infrastructure, so problems often surface as degraded application performance rather than clear infrastructure alerts.

Tasks stuck in PENDING state

Service auto-scaling failures

Task definition misconfigurations

  • Severity: 6 (Application functionality impacted)

  • Occurrence: 5 (Common during deployments)

  • Detection: 5 (Detected during deployment or runtime)

  • RPN: 150

  • Root causes: Manual configuration errors, missing environment variables

  • Mitigation strategies:

Amazon ECR

Amazon Elastic Container Registry (Amazon ECR) high-priority failure modes center on the security and availability of container images in your deployment pipeline. Vulnerability scanning gaps and unsigned images can introduce security risk that goes undetected until an audit, while image pull failures can block deployments entirely.

Vulnerability scanning failures

Malicious image deployment

  • Severity: 9 (Critical security breach)

  • Occurrence: 2 (Rare but possible)

  • Detection: 10 (Very difficult to detect)

  • RPN: 180

  • Root causes: Compromised build pipeline, insider threats, supply chain attacks

  • Mitigation strategies:

Image pull failures during deployment

  • Severity: 7 (Deployment failures, service disruption)

  • Occurrence: 3 (Occasional network or permission issues)

  • Detection: 4 (Quickly detected during deployment)

  • RPN: 84

  • Root causes: Network connectivity, IAM permissions, registry availability

  • Mitigation strategies:

Amazon EFS

Amazon Elastic File System (Amazon EFS) high-priority failure modes revolve around data protection, network connectivity, and performance configuration. Backup failures are particularly high risk because they often go undetected until a restore is needed.

Backup failures

Mount target connectivity issues

Performance mode misconfigurations

  • Severity: 6 (Performance impact)

  • Occurrence: 4 (Common during initial setup)

  • Detection: 5 (Detected through performance monitoring)

  • RPN: 120

  • Root causes: Incorrect performance mode selection, throughput mode misconfigurations

  • Mitigation strategies:

Amazon RDS

Amazon Relational Database Service (Amazon RDS) high-priority failure modes span licensing, storage, performance, and configuration. Because the database layer underpins most application functionality, failures here tend to score high on severity.

Oracle license compliance violations

  • Severity: 8 (Legal and financial risk)

  • Occurrence: 3 (Possible during scaling events)

  • Detection: 8 (Difficult to detect without proper monitoring)

  • RPN: 192

  • Root causes: Auto-scaling beyond license limits, manual instance modifications, license tracking gaps

  • Mitigation strategies:

    • Implement automated license usage monitoring

    • Configure scaling limits based on license capacity

    • Create license compliance dashboards

    • Establish regular license audit procedures

Database storage exhaustion

Performance degradation

  • Severity: 7 (Application slowness, user impact)

  • Occurrence: 4 (Common during peak usage)

  • Detection: 6 (Detected through performance monitoring)

  • RPN: 168

  • Root causes: Query inefficiencies, resource constraints, parameter misconfigurations

  • Mitigation strategies:

Parameter group misconfigurations

  • Severity: 7 (Performance or functionality issues)

  • Occurrence: 4 (Common during configuration changes)

  • Detection: 6 (Detected through monitoring or testing)

  • RPN: 168

  • Root causes: Manual configuration errors, incompatible parameter combinations, version upgrade issues

  • Mitigation strategies:

    • Implement IaC for parameter groups

    • Create parameter validation and testing procedures

    • Establish configuration change management processes

    • Configure automated rollback capabilities

Route 53

Amazon Route 53 failure modes are high-severity because DNS sits at the front of every request path. A misconfigured failover policy or a false-positive health check can make your entire application unreachable or silently route traffic to an unhealthy endpoint. These issues are rare but difficult to detect before customers are affected.

DNS failover mechanism failures

Health check false positives or negatives

  • Severity: 8 (Unnecessary failovers or missed failures)

  • Occurrence: 4 (Common with complex health checks)

  • Detection: 5 (Detected through monitoring)

  • RPN: 160

  • Root causes: Network latency, application-specific health check logic, timeout configurations

  • Mitigation strategies:

Latency-based routing misconfigurations

  • Severity: 6 (Suboptimal user experience)

  • Occurrence: 4 (Common during configuration changes)

  • Detection: 6 (Detected through performance monitoring)

  • RPN: 144

  • Root causes: Incorrect latency measurements, routing policy conflicts, geographic misconfigurations

  • Mitigation strategies:

    • Implement continuous latency monitoring

    • Create routing policy validation procedures

    • Establish A/B testing for routing changes

    • Configure automated rollback mechanisms