Domain 4: ML Solution Monitoring, Maintenance, and Security (24% of the exam content)
This domain accounts for 24% of the exam content.
Topics
Task 4.1: Monitor model inference
Knowledge of:
Drift in ML models
Techniques to monitor data quality and model performance
Design principles for ML lenses relevant to monitoring
Skills in:
Monitoring models in production (for example, by using SageMaker Model Monitor)
Monitoring workflows to detect anomalies or errors in data processing or model inference
Detecting changes in the distribution of data that can affect model performance (for example, by using SageMaker Clarify)
Monitoring model performance in production by using A/B testing
Task 4.2: Monitor and optimize infrastructure and costs
Knowledge of:
Key performance metrics for ML infrastructure (for example, utilization, throughput, availability, scalability, fault tolerance)
Monitoring and observability tools to troubleshoot latency and performance issues (for example, X-Ray, Amazon CloudWatch Lambda Insights, Amazon CloudWatch Logs Insights)
How to use CloudTrail to log, monitor, and invoke re-training activities
Differences between instance types and how they affect performance (for example, memory optimized, compute optimized, general purpose, inference optimized)
Capabilities of cost analysis tools (for example, Cost Explorer, Billing and Cost Management, Trusted Advisor)
Cost tracking and allocation techniques (for example, resource tagging)
Skills in:
Configuring and using tools to troubleshoot and analyze resources (for example, CloudWatch Logs, CloudWatch alarms)
Creating CloudTrail trails
Setting up dashboards to monitor performance metrics (for example, by using Amazon QuickSight, CloudWatch dashboards)
Monitoring infrastructure (for example, by using EventBridge events)
Rightsizing instance families and sizes (for example, by using SageMaker Inference Recommender and Datenverarbeitung Optimizer)
Monitoring and resolving latency and scaling issues
Preparing infrastructure for cost monitoring (for example, by applying a tagging strategy)
Troubleshooting capacity concerns that involve cost and performance (for example, provisioned concurrency, service quotas, auto scaling)
Optimizing costs and setting cost quotas by using appropriate cost management tools (for example, Cost Explorer, Trusted Advisor, Budgets)
Optimizing infrastructure costs by selecting purchasing options (for example, Spot Instances, On-Demand Instances, Reserved Instances, SageMaker Savings Plans)
Task 4.3: Secure resources
Knowledge of:
IAM roles, policies, and groups that control access to services (for example, Identity and Access Management [IAM], bucket policies, SageMaker Role Manager)
SageMaker security and compliance features
Controls for network access to ML resources
Security best practices for CI/CD pipelines
Skills in:
Configuring least privilege access to ML artifacts
Configuring IAM policies and roles for users and applications that interact with ML systems
Monitoring, auditing, and logging ML systems to ensure continued security and compliance
Troubleshooting and debugging security issues
Building VPCs, subnets, and security groups to securely isolate ML systems