Amazon Redshift deep dive - Data Warehousing on AWS

Amazon Redshift deep dive

As a columnar MPP technology, Amazon Redshift offers key benefits for performant, cost-effective data warehousing, including efficient compression, reduced I/O, and lower storage requirements. It is based on ANSI SQL, so you can run existing queries with little or no modification. As a result, it is a popular choice for enterprise data warehouses.

Amazon Redshift delivers fast query and I/O performance for virtually any data size by using columnar storage, and by parallelizing and distributing queries across multiple nodes. It automates most of the common administrative tasks associated with provisioning, configuring, monitoring, backing up, and securing a data warehouse, making it easy and inexpensive to manage. Using this automation, you can build petabyte-scale data warehouses in minutes instead of the weeks or months taken by traditional on-premises implementations. You can also run exabytes-scale queries by storing data on S3 and querying it using Amazon Redshift Spectrum. Amazon Redshift also enables you to scale compute and storage separately using Amazon Redshift RA3 nodes. RA3 nodes come with Redshift Managed Storage (RMS), which leverages your workload patterns and advanced data management techniques, such as automatic fine-grained data eviction and intelligent data pre-fetching. You can size your cluster based on your compute needs only, and pay only for the storage used.

Integration with data lake

Amazon Redshift provides a feature called Redshift Spectrum that makes it easier to both query data and write data back to your data lake in open file formats. With Spectrum, you can query open file formats such as Parquet, ORC, JSON, Avro, CSV, and more directly in S3 using familiar ANSI SQL. To export data to your data lake, you simply use the Redshift UNLOAD command in your SQL code and specify Parquet as the file format, and Redshift automatically takes care of data formatting and data movement into S3. To query data in S3, you create an external schema if the S3 object is already cataloged, or create an external table. You can write data to external tables by running CREATE EXTERNAL TABLE AS SELECT or INSERT INTO an external table. This gives you the flexibility to store highly structured, frequently accessed data in a Redshift data warehouse, while also keeping up to exabytes of structured, semi-structured, and unstructured data in S3. Exporting data from Amazon Redshift back to your data lake enables you to analyze the data further with AWS services like Amazon Athena, Amazon EMR and Amazon SageMaker

Performance

Amazon Redshift offers fast, industry-leading performance with flexibility. Amazon Redshift offers multiple features to achieve this superior performance, including:

  • High performing hardware — The Amazon Redshift Service offers multiple node types to choose from based on your requirements. The latest generation RA3 instances are built on the AWS Nitro System and feature high bandwidth networking, and performance indistinguishable from bare metal. These Amazon Redshift instances maximize speed for performance-intensive workloads that require large amounts of compute capacity, with the flexibility to pay by usage for storage, and pay separately for compute by specifying the number of instances you need.

  • AQUA (preview)AQUA (Advanced Query Accelerator) is a distributed and hardware-accelerated cache that enables Amazon Redshift to run up to ten times faster than any other cloud data warehouse. AQUA accelerates Amazon Redshift queries by running data intensive tasks such as filtering and aggregation closer to the storage layer. This avoids networking bandwidth limitations by eliminating unnecessary data movement between where data is stored and compute clusters. AQUA uses AWS-designed processors to accelerate queries. This includes AWS Nitro chips adapted to speed up data encryption and compression, and custom analytics processors, implemented in field-programmable gate arrays (FPGAs), to accelerate operations such as filtering and aggregation. AQUA can process large amounts of data in parallel across multiple nodes, and automatically scales out to add more capacity as your storage needs grow over time.

  • Efficient storage and high-performance query processing — Amazon Redshift delivers fast query performance on datasets ranging in size from gigabytes to petabytes. Columnar storage, data compression, and zone maps reduce the amount of I/O needed to perform queries. Along with the industry standard encodings such as LZO and Zstandard, Amazon Redshift also offers purpose-built compression encoding, AZ64, for numeric and date/time types to provide both storage savings and optimized query performance.

  • Materialized views — Amazon Redshift materialized views enable you to achieve significantly faster query performance for analytical workloads such as dashboarding, queries from BI tools, and ELT data processing jobs. You can use materialized views to store frequently used precomputations to speed up slow-running queries. Amazon Redshift can efficiently maintain the materialized views incrementally to speed up ELT, and provide low latency performance benefits. For more information, see Creating materialized views in Amazon Redshift

  • Auto workload management to maximize throughput and performance — Amazon Redshift uses machine learning to tune configuration to achieve high throughput and performance, even with varying workloads or concurrent user activity. Amazon Redshift utilizes sophisticated algorithms to predict and classify incoming queries based on their run times and resource requirements to dynamically manage resources and concurrency while also enabling you to prioritize your business-critical workloads. Short query acceleration (SQA) sends short queries to an express queue for immediate processing rather than waiting behind long running queries. You can set the priority of your most important queries, even when hundreds of queries are being submitted.

    Amazon Redshift is also a self-learning system that observes the user workload continuously, detecting opportunities to improve performance as the usage grows, applying optimizations seamlessly, and making recommendations via Redshift Advisor when an explicit user action is needed to further turbocharge Amazon Redshift performance. 

  • Result caching — Amazon Redshift uses result caching to deliver sub-second response times for repeated queries. Dashboard, visualization, and business intelligence tools that execute repeated queries experience a significant performance boost. When a query executes, Amazon Redshift searches the cache to see if there is a cached result from a prior run. If a cached result is found and the data has not changed, the cached result is returned immediately instead of re-running the query.

Durability and availability

To provide the best possible data durability and availability, Amazon Redshift automatically detects and replaces any failed node in your data warehouse cluster. It makes your replacement node available immediately, and loads your most frequently accessed data first so you can resume querying your data as quickly as possible. Amazon Redshift attempts to maintain at least three copies of data: the original and replica on the compute nodes, and a backup in S3. The cluster is in read-only mode until a replacement node is provisioned and added to the cluster, which typically takes only a few minutes.

Amazon Redshift clusters reside within one Availability Zone. However, if you want to a Multi-AZ setup for Amazon Redshift, you can create a mirror and then self-manage replication and failover.

With just a few clicks in the Amazon Redshift Management Console, you can set up a robust disaster recovery (DR) environment with Amazon Redshift. Amazon Redshift automatically takes incremental snapshots (backups) of your data every eight hours, or five gigabytes (GBs) per node of data change. You can get more information and control over a snapshot, including the ability to control the automatic snapshot's schedule.

You can keep copies of your backups in multiple AWS Regions. In case of a service interruption in one AWS Region, you can restore your cluster from the backup in a different AWS Region. You can gain read/write access to your cluster within a few minutes of initiating the restore operation.

Elasticity and scalability

With Amazon Redshift, you get the elasticity and scalability you need for your data warehousing workloads. You can scale compute and storage independently, and pay only for what you use. With the elasticity and scalability that Amazon Redshift offers, you can easily run non-uniform and unpredictable data warehousing workloads. Amazon Redshift provides two forms of compute elasticity:

  • Elastic resize — With the elastic resize feature, you can quickly resize your Amazon cluster by adding nodes to get the resources needed for demanding workloads, and to remove nodes when the job is complete to save cost. Additional nodes are added or removed in minutes with minimal disruption to on-going read and write queries. Elastic resize can be automated using a schedule you define to accommodate changes in workload that occur on a regular basis. Resize can be scheduled with a few clicks in the console or programmatically using the AWS command line interface (AWS CLI), or an API call.

  • Concurrency Scaling — With the Concurrency Scaling feature, you can support virtually unlimited concurrent users and up to 10 concurrent queries, with consistently fast query performance. When concurrency scaling is enabled, Amazon Redshift automatically adds additional compute capacity when you need it to process an increase in concurrent read queries. Write operations continue as normal on your main cluster. Users always see the most current data, whether the queries run on the main cluster or on a concurrency scaling cluster.

    Amazon Redshift enables you to start with as little as a single 160 GB node and scale up all the way to multiple petabytes of compressed user data using many nodes. For more information, see About Clusters and Nodes in the Amazon Redshift Cluster Management Guide.

Amazon Redshift managed storage

Amazon Redshift managed storage enables you to scale and pay for compute and storage independently so you can size your cluster based only on your compute needs. It automatically uses high-performance solid-state drive (SSD)-based local storage as tier-1 cache, and takes advantage of optimizations such as data block temperature, data block age, and workload patterns to deliver high performance while scaling storage automatically when needed without requiring any action.