Central Storage Layer - Analytics Lens

Central Storage Layer

The central storage layer manages the storage of data as it’s ingested from a variety of producers and makes it available to downstream applications. This layer is at the core of a data lake and should support housing of all types of data: unstructured, semi-structured, and structured data. As data grows over time, this layer should scale elastically in a secure and cost-effective manner.

In data processing pipelines, data might be stored at intermediate stages of processing, both to avoid needless duplication of work up to that point in the pipeline, as well as to make intermediate data available to multiple downstream consumers. Intermediate data might be frequently updated, stored temporarily, or stored long term, depending on the use case.

Amazon S3 provides an optimal foundation for central storage because of its virtually unlimited scalability, 99.999999999% (11 “nines”) of durability, native encryption, and access control capabilities. As data storage requirements increase over time, data can be transitioned to lower-cost tiers, such as S3 Infrequent Access or Amazon S3 Glacier, through lifecycle policies to save on storage costs while still preserving the original, raw data. You can also use S3 Intelligent-Tiering, which optimizes storage costs automatically when data access patterns change, without performance impact or operational overhead.

Amazon S3 makes it easy to build a multi-tenant environment, where many users can bring their own data analytics tools to a common set of data. This improves both cost and data governance over that of traditional solutions, which commonly require multiple, distributed copies of the data. To enable easy access, Amazon S3 provides RESTful APIs that are simple and supported by Apache Hadoop as well as most major third-party independent software vendors (ISVs) and analytics tool vendors.

With Amazon S3, your data lake can decouple storage from compute and data processing. In traditional Hadoop and data warehouse solutions, storage and compute are tightly coupled, making it difficult to optimize costs and data processing workflows. Amazon S3 allows you to store all data types in their native formats and use as many or as a few virtual servers as you want to process the data. You can also integrate with serverless solutions, such as AWS Lambda, Amazon Athena, Amazon Redshift Spectrum, Amazon Rekognition, and AWS Glue, that allow you to process data without provisioning or managing servers.

Amazon Elastic Block Store (EBS) provides persistent block storage volumes for use with Amazon EC2 instances in the AWS Cloud. Each Amazon EBS volume is automatically replicated within its Availability Zone to protect you from component failure, which provides high availability and durability. For analytics workloads, you can use EBS with Big Data analytics engines (such as the Hadoop/HDFS ecosystem or Amazon EMR clusters), relational and NoSQL databases (such as Microsoft SQL Server and MySQL or Cassandra and MongoDB), stream and log processing applications (such as Kafka and Splunk), and data warehousing applications (such as Vertica and Teradata) running on EC2 instances.