AWS Well-Architected design framework
We designed this solution with best practices from the AWS Well-Architected Framework
This section describes how we applied the design principles and best practices of the Well-Architected Framework when building this solution.
Operational Excellence
-
Use of Solution Constructs - Beyond being a fully serverless automation workflow to ensure no single point of failure, the Maintaining Personalized Experiences with Machine Learning solution uses the AWS Solutions Constructs (a library of pre-built multi-service, well-architected patterns for quickly defining solutions in code to create predictable and repeatable infrastructure) where applicable.
-
Monitoring systems - CloudWatch Metrics are published for each Dataset Group and Solution Version combination to track offline metrics over time. All metrics published by Amazon Personalize are published by the solution as CloudWatch Metrics, and can (optionally) be baselined against the Popularity-Count offline metrics.
These metrics can be visualized in a CloudWatch dashboard (which can be enabled on user request by configuration change).
-
Continual improvement - Customers can effectively test out changes, troubleshoot problems and test new hypotheses when creating new solution versions.
-
Automating changes - The solution can be updated to activate new functionality by updating the CloudFormation template in place.
AWS Solutions Constructs patterns and the AWS CDK are used to automate the creation of the CloudFormation templates required by the solution.
-
Responding to events - The solution provides updates to the operator via an email through subscription to an SNS topic. This allows an operator to take action as issues are discovered in solution version and campaign creation (for example, insufficient data, misconfiguration), and be notified when specific conditions are met (e.g. solution version + campaign are ready).
-
Standards for daily ops - By standardizing how Amazon Personalize resources are created, and how bulk data is ingested into the service, operational procedures can be built around the Maintaining Personalized Experiences with Machine Learning solution. In the future, this will enable operators to integrate with other business systems (e.g. Segment as a CDP, Optimizely for A/B testing).
Security
-
Confidentiality of data - The data is encrypted at rest (using SSE-S3) in the customer bucket and it is encrypted at rest (using SSE-S3) in the Amazon Personalize server-side. Users can optionally enable usage of one of their own AWS KMS keys (the solution supports this).
-
Integrity of data - This solution leverages the AWS global infrastructure, which is built around AWS Regions and Availability Zones.
AWS Regions provide multiple physically separated and isolated Availability Zones, which are connected by low-latency, high-throughput, and highly redundant networking.
Apart from this, Amazon S3 which is used in the solution, is designed for 11 9s of durability.
-
Access management - Access is restricted to users through the use of IAM. Amazon Personalize uses an IAM role and policy to load data from S3. The use of of AWS managed policies is avoided where possible to provide the principle of least priviledge
Reliability
-
Common failures - Protection against task failures is provided by detection of common failures and sending alerts to an SNS topic, which provides a user notification via email.
-
AWS state machines - The Maintaining Personalized Experiences with Machine Learning solution workflow utilizes AWS Step functions. Step functions states can fail for a variety of reasons, including state machine definition errors, task failures, and transient issues. Protection against transient issues is provided by ensuring individual tasks in the step function are idempotent where possible. This allows effective retrial of tasks.
-
Exponential backoff to prevent throttling - Exponential backoff is utilized on the Personalize service calls to reduce the likelihood of throttling due to service-side rate limiting. These include:
-
Rate limiting for > 5 pending/ in progress dataset import jobs
-
Rate limiting for > 5 pending/ in progress batch inference jobs
-
Rate limiting for > 5 pending/ in progress solutions versions
-
Furthermore, the AWS (boto3), the Python SDK used to develop the solution implements client-side API throttling.
Performance Efficiency
-
Serverless operations - The Maintaining Personalized Experiences with Machine Learning solution uses a serverless architecture to eliminate our dependency on long-running EC2 instances. This significantly reduces cost of running workflows within AWS.
-
Default usage - Where possible, expensive operations are disabled by default (for example, API Gateway caching is not enabled by default).
Cost Optimization
The use of a serverless architecture in the Maintaining Personalized Experiences with Machine Learning solution eliminates the dependency on long-running EC2 instances. This significantly reduces the cost of running workflows within AWS. Combining the solution with the Amazon Personalize Monitor allows campaign TPS to be monitored efficiently and adjusted as required.
Sustainability
Sustainability in the cloud is a continuous effort focused primarily on energy reduction and efficiency across all components of a workload by achieving the maximum benefit from the resources provisioned and minimizing the total resources required.
Maintaining Personalized Experiences with Machine Learning solution uses serverless services (such as, AWS Lambda and AWS DynamoDB) which are aimed at reducing carbon footprint compared to the footprint of continually operating on-premises servers, or long-running EC2 instances. This eliminates idle resources, processing, and storage to reduce the total energy required to power customer workloads.