Reference Architecture - Analytics Lens

Reference Architecture

Figure 3: High-Level Batch Data Processing Architecture

  1. Batch data processing systems typically require a persistent data store for source data. This is important when considering reliability of your system. Persisting your source datasets in durable storage enables you to retry processing jobs in case of failure and unlock new value streams in the future. On AWS, you have an extensive set of options for source data storage. Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon EFS, Amazon Elasticsearch Service, Amazon Redshift, and Amazon Neptune are managed database and storage services you can use as source data stores. You also have the option of using Amazon EC2 and EBS to run your own database or storage solutions. See the Data Lake and Building an Efficient Storage Layer for Analytics scenarios for deeper dives into these options.

  2. Batch data processing systems should be automated and scheduled for reliability, performance efficiency, and cost optimization. You can use Amazon CloudWatch Events to trigger downstream jobs based on schedules (for example, once a day) or events (for example, when new files are uploaded).

  3. It’s common for batch data processing jobs to have multiple steps—some of which might happen in sequence or in parallel. Using an orchestration service, such as AWS Step Functions, makes it easy to implement automated workflows for simple and complex processing jobs. With AWS Step Functions you can build distributed data processing applications using visual workflows. Within an AWS Step Functions workflow, you can use Lambda functions and native service integrations to trigger Amazon EMR steps, AWS Glue ETL Jobs, AWS Batch jobs, Amazon SageMaker jobs, and custom jobs on Amazon EC2 or on premises.

  4. AWS Batch, AWS Glue, and Amazon EMR provide managed services and frameworks for batch job execution that fit your specific use case. You have different options for running jobs. For simple jobs that can run in Docker containers—such as video media processing, machine learning training, and file compression—AWS Batch provides a convenient way to submit jobs as Docker containers to a container compute infrastructure on Amazon EC2. For Apache Spark jobs in PySpark or Scala, you can use AWS Glue, which runs Spark jobs in a fully managed Spark environment. For other massively parallel processing jobs, Amazon EMR provides frameworks like Spark, MapReduce, Hive, Presto, Flink, and Tez that run on Amazon EC2 instances in your VPC.

  5. Similar to the source data stores, batch jobs require reliable storage to store job outputs or results. You can use the AWS SDK for interacting with Amazon S3 and DynamoDB, and use common file protocols and JDBC connections to store results in file systems or databases.

  6. Result datasets are commonly persisted and later accessed by visualization tools, such as Amazon QuickSight, APIs, and search-based queries. Based on the access pattern, you might choose data stores that best fit your use-case. See the Data Lake and Building an Efficient Storage Layer for Analytics scenarios for deeper dives into these storage options.