Configuration Notes - Analytics Lens

Configuration Notes

  1. Use Batch Processing jobs to prepare large, bulk datasets for downstream analytics. For large, complex datasets, you may need to provide that data to end users and analysts in such a way that simplifies queries for them. In contrast, these users may find it difficult to query the raw data when they are trying to find simple aggregations. For example, you may want to preprocess a daily sales summarized view of your data for the previous day’s sales. This provides users with a table that has fewer rows and columns, making it easier and faster for business users to query the data.

  2. Avoid lifting and shifting batch processing to AWS. By lifting and shifting traditional batch processing systems into AWS, you risk running over-provisioned resources on Amazon EC2. For example, traditional Hadoop clusters are often over-provisioned and idle in an on-premises setting. Use AWS managed services, such as AWS Glue, Amazon EMR, and AWS Batch, to simplify your architecture and remove the undifferentiated heavy lifting of managing clustered and distributed environments.

    By effectively leveraging these services with modern batch processing architectures that separate storage and compute, you present opportunities to save on costs by eliminating idle compute resources or underutilized disk storage space. Performance can also be improved by using EC2 instance types that are optimized for your specific batch processing tasks, rather than using multi-purpose persistent clusters.

  3. Automate and orchestrate everywhere. In a traditional batch data processing environment, it’s a best practice to automate and schedule your jobs in the system. In AWS, you should leverage automation and orchestration for your batch data processing jobs in conjunction with the AWS APIs to spin up and tear down entire compute environments as well, so that you are only charged when the compute services are in use. For example, when a job is scheduled, a workflow service, such as AWS Step Functions, would use the AWS SDK to provision a new EMR cluster, submit the work, and terminate the cluster after the job is complete.

  4. Use Spot Instances to save on flexible batch processing jobs. Leverage Spot Instances when you have flexible job schedules, can retry jobs, and can decouple the data from the compute. Use Spot Fleet, EC2 Fleet, and Spot Instance features in EMR and AWS Batch to manage Spot Instances.

  5. Continuously monitor and improve batch processing. Batch processing systems evolve rapidly as data source volumes increase, new batch processing jobs are authored, and new batch processing frameworks are launched. Instrument your jobs with metrics, timeouts, and alarms to have the metrics and insight to make informed decisions on batch data processing system changes.