Reliability pillar
The reliability pillar of the AWS Well-Architected Framework addresses how well a system maintains its intended functionality and performance levels during expected operational periods throughout its lifespan. It provides comprehensive guidelines for building and maintaining dependable systems on AWS, including strategies for testing and validation across all stages of the workload lifecycle.
Key focus areas for applying this pillar to your AppStream 2.0 streaming environment:
-
Fleet management and scaling
-
Session reliability
-
Application availability
-
Recovery procedures
Automatically recover from failure
Monitor KPIs for business value to trigger automated responses that can predict, prevent, or recover from failures before they impact operations.
-
Make sure that your IP subnet allocation accounts for expansion and availability.
-
Monitor critical CloudWatch metrics to ensure service availability and performance, including fleet capacity metrics such as
AvailableCapacity
andInUseCapacity
, and streaming quality metrics such asStreamingSessionLatency
. -
Configure alerts for capacity thresholds, session health metrics, performance degradation, and fleet health status changes.
-
Use built-in AppStream 2.0 automatic scaling capabilities to:
-
Configure minimum and maximum fleet capacity.
-
Set scaling policies based on capacity utilization.
-
Define scale-out and scale-in thresholds based on user experience metrics and business requirements instead of only technical metrics.
-
-
Build a disaster recovery environment for your AppStream 2.0 environment. For more information, see the AWS blog post Disaster recovery considerations with Amazon AppStream 2.0
.
Test recovery procedures
Cloud environments enable automated testing of failure scenarios and recovery procedures. These capabilities help you identify and fix vulnerabilities before real failures occur.
-
Fleet recovery testing. Implement comprehensive fleet recovery testing across multiple scenarios:
-
Simulate instance termination to verify automatic scaling response.
-
Validate fleet minimum capacity maintenance.
-
Test instance replacement timing and user redirection.
-
Validate scaling policies effectiveness.
-
Test fleet capacity limits and overflow handling.
-
-
Session recovery testing. Implement session recovery validation procedures:
-
Test disconnect and reconnect scenarios.
-
Verify application state preservation.
-
Test various network interruption scenarios.
-
Validate session timeout behaviors.
-
Verify user authentication persistence.
-
Verify temporary storage handling.
-
Scale horizontally to increase aggregate workload availability
Distribute your workload across multiple, smaller resources to minimize the impact of individual failures and to eliminate single points of failure.
-
Deploy fleet instances across multiple Availability Zones.
-
Configure appropriate minimum fleet capacity.
-
Configure automatic scaling for fleets and set appropriate scaling thresholds.
-
Monitor capacity utilization across the fleet.
-
Deploy AppStream 2.0 stacks across multiple Regions. For more information, see the AWS blog post Optimize user experience with latency-based routing for Amazon AppStream 2.0
.
Stop guessing capacity
Use the automatic scaling capabilities of the cloud to dynamically adjust resources based on demand. This helps prevent resource saturation while maintaining optimal efficiency.
-
Monitor key metrics such as
CapacityUtilization
,AvailableCapacity
, andInUseCapacity
to understand capacity needs. -
Track fleet utilization trends across different time periods. Monitor daily patterns, weekly variations, monthly trends, and seasonal peaks.
-
Set up scaling policies and configure scaling thresholds.
-
Make sure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover.
-
Accommodate fixed service quotas and constraints through your architecture.
Manage change through automation
Implement infrastructure changes through automation, including version-controlled changes to the automation code itself.
-
Use IaC for fleet configuration.
-
Implement consistent scaling policies.
-
Use the Image Assistant CLI for consistent image creation.