Domain 1: Data Ingestion and Transformation (36% of the exam content)
This domain accounts for 36% of the exam content.
Topics
Task 1.1: Perform data ingestion
Knowledge of:
Throughput and latency characteristics for services that ingest data
Data ingestion patterns (for example, frequency and data history)
Streaming data ingestion
Batch data ingestion (for example, scheduled ingestion, event-driven ingestion)
Replayability of data ingestion pipelines
Stateful and stateless data transactions
Skills in:
Reading data from streaming sources (for example, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka [Amazon MSK], Amazon DynamoDB Streams, Datenbank Migration Service [ DMS], Glue, Amazon Redshift)
Reading data from batch sources (for example, Amazon S3, Glue, Amazon EMR, DMS, Amazon Redshift, Lambda, Amazon AppFlow)
Implementing appropriate configuration options for batch ingestion
Consuming data APIs
Setting up schedulers by using Amazon EventBridge, Apache Airflow, or time-based schedules for jobs and crawlers
Setting up event triggers (for example, Amazon S3 Event Notifications, EventBridge)
Calling a Lambda function from Amazon Kinesis
Creating allowlists for IP addresses to allow connections to data sources
Implementing throttling and overcoming rate limits (for example, DynamoDB, Amazon RDS, Kinesis)
Managing fan-in and fan-out for streaming data distribution
Task 1.2: Transform and process data
Knowledge of:
Creation of ETL pipelines based on business requirements
Volume, velocity, and variety of data (for example, structured data, unstructured data)
Cloud computing and distributed computing
How to use Apache Spark to process data
Intermediate data staging locations
Skills in:
Optimizing container usage for performance needs (for example, Amazon Elastic Kubernetes Service [Amazon EKS], Amazon Elastic Container Service [Amazon ECS])
Connecting to different data sources (for example, Java Datenbank Connectivity [JDBC], Open Datenbank Connectivity [ODBC])
Integrating data from multiple sources
Optimizing costs while processing data
Implementing data transformation services based on requirements (for example, Amazon EMR, Glue, Lambda, Amazon Redshift)
Transforming data between formats (for example, from .csv to Apache Parquet)
Troubleshooting and debugging common transformation failures and performance issues
Creating data APIs to make data available to other systems by using services
Task 1.3: Orchestrate data pipelines
Knowledge of:
How to integrate various services to create ETL pipelines
Event-driven architecture
How to configure services for data pipelines based on schedules or dependencies
Serverlos workflows
Skills in:
Using orchestration services to build workflows for data ETL pipelines (for example, Lambda, EventBridge, Amazon Managed Workflows for Apache Airflow [Amazon MWAA], Step Functions, Glue workflows)
Building data pipelines for performance, availability, scalability, resiliency, and fault tolerance
Implementing and maintaining serverless workflows
Using notification services to send alerts (for example, Amazon Simple Notification Service [Amazon SNS], Amazon Simple Queue Service [Amazon SQS])
Task 1.4: Apply programming concepts
Knowledge of:
Continuous integration and continuous delivery (CI/CD) (implementation, testing, and deployment of data pipelines)
SQL queries (for data source queries and data transformations)
Infrastructure as code (IaC) for repeatable deployments (for example, Cloud Development Kit [ CDK], CloudFormation)
Distributed computing
Data structures and algorithms (for example, graph data structures and tree data structures)
SQL query optimization
Skills in:
Optimizing code to reduce runtime for data ingestion and transformation
Configuring Lambda functions to meet concurrency and performance needs
Performing SQL queries to transform data (for example, Amazon Redshift stored procedures)
Structuring SQL queries to meet data pipeline requirements
Using Git commands to perform actions such as creating, updating, cloning, and branching repositories
Using the Serverlos Application Model ( SAM) to package and deploy serverless data pipelines (for example, Lambda functions, Step Functions, DynamoDB tables)
Using and mounting storage volumes from within Lambda functions