Appendix: AWS service-specific FMEA implementation
This reference guide provides example pre-analyzed failure modes for common AWS services, including Risk Priority Number (RPN) calculations and recommended mitigation strategies. Use this as a starting point for your FMEA analysis and customize based on your specific application architecture and business requirements.
Application Load Balancer
Application Load Balancer high-priority failure modes affect traffic routing and availability at the edge of your application. Health check misconfigurations can silently pull healthy targets out of rotation, expired SSL certificates can make your site unreachable, and listener rule errors can send requests to the wrong backend.
Target health check failures
-
Severity: 8 (Service unavailability)
-
Occurrence: 4 (Common with application issues)
-
Detection: 5 (Detected through Application Load Balancer monitoring)
-
RPN: 160
-
Root causes: Application startup delays, health check endpoint issues, network connectivity
-
Mitigation strategies:
-
Create automated target registration procedures
-
Establish health check troubleshooting guides
SSL certificate expiration
-
Severity: 9 (Service unavailability, security warnings)
-
Occurrence: 2 (Preventable with proper management)
-
Detection: 8 (Often detected too late)
-
RPN: 144
-
Root causes: Manual certificate management, notification failures, renewal process gaps
-
Mitigation strategies:
-
Create certificate management procedures
-
Establish automated renewal validation
Listener rule misconfigurations
-
Severity: 7 (Traffic routing issues)
-
Occurrence: 4 (Common during configuration changes)
-
Detection: 6 (Detected through monitoring or user reports)
-
RPN: 168
-
Root causes: Manual configuration errors, rule priority conflicts, condition mismatches
-
Mitigation strategies:
-
Implement infrastructure as code (IaC) for Application Load Balancer configurations
-
Create listener rule validation procedures
-
Establish configuration change management
-
Configure traffic routing monitoring
-
Amazon ECS
Amazon Elastic Container Service (Amazon ECS) high-priority failure modes typically involve tasks failing to launch, scaling policies not responding to load, or misconfigurations in task definitions. These issues can be difficult to detect because Amazon ECS abstracts much of the underlying infrastructure, so problems often surface as degraded application performance rather than clear infrastructure alerts.
Tasks stuck in PENDING state
-
Severity: 8 (Service unavailable, customer impact)
-
Occurrence: 4 (Monthly occurrence possible)
-
Detection: 7 (Difficult to detect until customer reports)
-
RPN: 224
-
Root causes: IAM permissions, resource constraints, network issues
-
Mitigation strategies:
-
Create detailed troubleshooting runbooks
Service auto-scaling failures
-
Severity: 7 (Performance degradation during peak load)
-
Occurrence: 3 (Quarterly occurrence)
-
Detection: 6 (Detected through performance monitoring)
-
RPN: 126
-
Root causes: Incorrect scaling policies, resource limits, metric delays
-
Mitigation strategies:
-
Set up load testing for scaling validation
-
Create capacity planning procedures
Task definition misconfigurations
-
Severity: 6 (Application functionality impacted)
-
Occurrence: 5 (Common during deployments)
-
Detection: 5 (Detected during deployment or runtime)
-
RPN: 150
-
Root causes: Manual configuration errors, missing environment variables
-
Mitigation strategies:
-
Create automated validation pipelines
-
Establish peer review processes for configuration changes
-
Use configuration management tools
Amazon ECR
Amazon Elastic Container Registry (Amazon ECR) high-priority failure modes center on the security and availability of container images in your deployment pipeline. Vulnerability scanning gaps and unsigned images can introduce security risk that goes undetected until an audit, while image pull failures can block deployments entirely.
Vulnerability scanning failures
-
Severity: 8 (Security compliance risk)
-
Occurrence: 4 (Regular occurrence with new images)
-
Detection: 7 (May not be detected until audit)
-
RPN: 224
-
Root causes: Scanner limitations, new vulnerability databases, image complexity
-
Mitigation strategies:
-
Establish vulnerability remediation procedures
-
Configure continuous monitoring for new vulnerabilities
Malicious image deployment
-
Severity: 9 (Critical security breach)
-
Occurrence: 2 (Rare but possible)
-
Detection: 10 (Very difficult to detect)
-
RPN: 180
-
Root causes: Compromised build pipeline, insider threats, supply chain attacks
-
Mitigation strategies:
-
Create secure build environments with limited access
-
Establish image provenance tracking
-
Configure runtime security monitoring
Image pull failures during deployment
-
Severity: 7 (Deployment failures, service disruption)
-
Occurrence: 3 (Occasional network or permission issues)
-
Detection: 4 (Quickly detected during deployment)
-
RPN: 84
-
Root causes: Network connectivity, IAM permissions, registry availability
-
Mitigation strategies:
-
Configure multiple registry endpoints
-
Create automated retry mechanisms
-
Establish network connectivity monitoring
-
Amazon EFS
Amazon Elastic File System (Amazon EFS) high-priority failure modes revolve around data protection, network connectivity, and performance configuration. Backup failures are particularly high risk because they often go undetected until a restore is needed.
Backup failures
-
Severity: 9 (Data loss risk)
-
Occurrence: 3 (Possible with configuration issues)
-
Detection: 7 (May not be detected until restore needed)
-
RPN: 189
-
Root causes: IAM permission issues, backup policy misconfigurations, storage constraints
-
Mitigation strategies:
-
Configure multiple backup strategies (AWS Backup, custom scripts)
Mount target connectivity issues
-
Severity: 8 (File system unavailable)
-
Occurrence: 2 (Rare network issues)
-
Detection: 6 (Detected through application errors)
-
RPN: 96
-
Root causes: Network connectivity, security group misconfigurations, DNS resolution issues
-
Mitigation strategies:
-
Create automated mount target health checks
-
Establish network troubleshooting procedures
Performance mode misconfigurations
-
Severity: 6 (Performance impact)
-
Occurrence: 4 (Common during initial setup)
-
Detection: 5 (Detected through performance monitoring)
-
RPN: 120
-
Root causes: Incorrect performance mode selection, throughput mode misconfigurations
-
Mitigation strategies:
-
Implement performance testing during setup
-
Create performance mode selection guidelines
-
Establish performance optimization procedures
-
Amazon RDS
Amazon Relational Database Service (Amazon RDS) high-priority failure modes span licensing, storage, performance, and configuration. Because the database layer underpins most application functionality, failures here tend to score high on severity.
Oracle license compliance violations
-
Severity: 8 (Legal and financial risk)
-
Occurrence: 3 (Possible during scaling events)
-
Detection: 8 (Difficult to detect without proper monitoring)
-
RPN: 192
-
Root causes: Auto-scaling beyond license limits, manual instance modifications, license tracking gaps
-
Mitigation strategies:
-
Implement automated license usage monitoring
-
Configure scaling limits based on license capacity
-
Create license compliance dashboards
-
Establish regular license audit procedures
-
Database storage exhaustion
-
Severity: 9 (Complete database unavailability)
-
Occurrence: 2 (Rare with proper monitoring)
-
Detection: 10 (Often detected too late)
-
RPN: 180
-
Root causes: Unexpected data growth, failed cleanup procedures, monitoring gaps
-
Mitigation strategies:
-
Create automated data archival procedures
-
Establish capacity planning processes
Performance degradation
-
Severity: 7 (Application slowness, user impact)
-
Occurrence: 4 (Common during peak usage)
-
Detection: 6 (Detected through performance monitoring)
-
RPN: 168
-
Root causes: Query inefficiencies, resource constraints, parameter misconfigurations
-
Mitigation strategies:
-
Implement automated query optimization
-
Create performance baseline and alerting
Parameter group misconfigurations
-
Severity: 7 (Performance or functionality issues)
-
Occurrence: 4 (Common during configuration changes)
-
Detection: 6 (Detected through monitoring or testing)
-
RPN: 168
-
Root causes: Manual configuration errors, incompatible parameter combinations, version upgrade issues
-
Mitigation strategies:
-
Implement IaC for parameter groups
-
Create parameter validation and testing procedures
-
Establish configuration change management processes
-
Configure automated rollback capabilities
-
Route 53
Amazon Route 53 failure modes are high-severity because DNS sits at the front of every request path. A misconfigured failover policy or a false-positive health check can make your entire application unreachable or silently route traffic to an unhealthy endpoint. These issues are rare but difficult to detect before customers are affected.
DNS failover mechanism failures
-
Severity: 9 (Complete service unavailability)
-
Occurrence: 3 (Rare but critical)
-
Detection: 8 (Difficult to detect until customer impact)
-
RPN: 216
-
Root causes: Health check misconfigurations, DNS propagation delays, routing policy errors
-
Mitigation strategies:
-
Create geographic routing validation procedures
Health check false positives or negatives
-
Severity: 8 (Unnecessary failovers or missed failures)
-
Occurrence: 4 (Common with complex health checks)
-
Detection: 5 (Detected through monitoring)
-
RPN: 160
-
Root causes: Network latency, application-specific health check logic, timeout configurations
-
Mitigation strategies:
-
Configure multiple health check types (HTTP, TCP, calculated)
-
Implement application-specific health endpoints
-
Set appropriate timeout and retry values
-
Create health check validation procedures
-
Latency-based routing misconfigurations
-
Severity: 6 (Suboptimal user experience)
-
Occurrence: 4 (Common during configuration changes)
-
Detection: 6 (Detected through performance monitoring)
-
RPN: 144
-
Root causes: Incorrect latency measurements, routing policy conflicts, geographic misconfigurations
-
Mitigation strategies:
-
Implement continuous latency monitoring
-
Create routing policy validation procedures
-
Establish A/B testing for routing changes
-
Configure automated rollback mechanisms
-