Configuration Notes - Analytics Lens

Configuration Notes

  • Decide on a location for data lake ingestion (that is, S3 bucket). Select a frequency and isolation mechanism that meets your business needs.

  • For Tier 2 Data, partition the data with keys that align to common query filters. This enables pruning by common analytics tools that work on raw data files and increases performance.

  • Choose optimal file sizes to reduce Amazon S3 round trips during compute environment ingestion:

    • Recommended: 512 MB – 1 GB in a columnar format (ORC/Parquet) per partition.

  • Perform frequent scheduled compactions that align to the optimal file sizes noted previously.

    • For example, compact into daily partitions if hourly files are too small.

  • For data with frequent updates or deletes (that is, mutable data):

    • Temporarily store replicated data to a database like Amazon Redshift, Apache Hive, or Amazon RDS until the data becomes static, and then offload it to Amazon S3, or

    • Append the data to delta files per partition and compact it on a scheduled basis using AWS Glue or Apache Spark on EMR.

  • With Tier 2 and Tier 3 Data being stored in Amazon S3:

    • Partition data using a high cardinality key. This is honored by Presto, Apache Hive, and Apache Spark and improves the query filter performance on that key.

    • Sort data in each partition with a secondary key that aligns to common filter queries. This allows query engines to skip files and get to requested data faster.