Data warehouse architecture components - AWS Prescriptive Guidance

Data warehouse architecture components

We recommend that you have a basic understanding of the core architecture components in an Amazon Redshift data warehouse. This knowledge can help you better understand how to design your queries and tables for optimal performance.

A data warehouse in Amazon Redshift consists of the following core architecture components:

  • Clusters – A cluster, which is composed of one or more compute nodes, is the core infrastructure component of an Amazon Redshift data warehouse. Compute nodes are transparent to external applications, but your client application interacts directly with the leader node only. A typical cluster has two or more compute nodes. The compute nodes are coordinated through the leader node.

  • Leader node – A leader node manages the communications for client programs and all compute nodes. A leader node also prepares the plans for running a query whenever a query is submitted to a cluster. When the plans are ready, the leader node compiles code, distributes the compiled code to the compute nodes, and then assigns slices of data to each compute node to process the query results.

  • Compute node – A compute node runs a query. The leader node compiles code for individual elements of the plan to run the query and assigns the code to individual compute nodes. The compute nodes run the compiled code and send intermediate results back to the leader node for final aggregation. Each compute node has its own dedicated CPU, memory, and attached disk storage. As your workload grows, you can increase the compute capacity and storage capacity of a cluster by increasing the number of nodes, upgrading the node type, or both.

  • Node slice – A compute node is partitioned into units called slices. Every slice in a compute node is allocated a portion of the node's memory and disk space where it processes a portion of the workload assigned to the node. The slices then work in parallel to complete the operation. Data is distributed among slices on the basis of the distribution style and distribution key of a particular table. An even distribution of data makes it possible for Amazon Redshift to evenly assign workloads to slices and maximizes the benefit of parallel processing. The number of slices per compute node is decided on the basis of the type of node. For more information, see Clusters and nodes in Amazon Redshift in the Amazon Redshift documentation.

  • Massively parallel processing (MPP) – Amazon Redshift uses MPP architecture to quickly process data, even complex queries and vast amounts of data. Multiple compute nodes run the same query code on portions of data to maximize parallel processing.

  • Advanced Query Accelerator (AQUA) – AQUA is a distributed hardware-accelerated cache that makes Amazon Redshift run up to 10 times faster than other enterprise cloud data warehouses by automatically boosting certain types of queries. AQUA also performs a substantial share of data processing in place, and uses AWS-designed processors and a scale-out architecture to accelerate data processing beyond anything traditional CPUs can do today. For more information, see the AQUA (Advanced Query Accelerator) – A Speed Boost for Your Amazon Redshift Queries post in the AWS News Blog.

  • Client application – Amazon Redshift integrates with various data loading, extract, transform, and load (ETL), business intelligence (BI) reporting, data mining, and analytics tools. All client applications communicate with the cluster through the leader node only.

The following diagram shows how the architecture components of an Amazon Redshift data warehouse work together to accelerate queries:

            Data warehouse architecture components

There are six stages of the query lifecycle:

  1. The Amazon Redshift leader node receives a query and parses the SQL code.

  2. The leader node builds and optimizes a query execution plan that breaks a query down into a sequence of steps, and then determines if any of these steps must be sent to the AQUA layer. The leader node determines if the AQUA layer is used or not.

  3. The leader node distributes the work of executing the steps in parallel across the compute nodes. The compute nodes send the AQUA subqueries (if there are any) to the AQUA fleet and then wait for the results.

  4. If AQUA receives a subquery, then AQUA processes the subquery and sends the results back to the compute nodes.

  5. The compute nodes execute all the steps and send the results back to the leader node.

  6. The leader node addresses any final sorting or aggregation, and then returns the results to the client.

For information on architecture components, see Data warehouse system architecture in the Amazon Redshift Database Developer Guide.