Execute Amazon Redshift SQL queries by using Terraform - AWS Prescriptive Guidance

Execute Amazon Redshift SQL queries by using Terraform

Created by Sylvia Qi (AWS) and Aditya Ambati (AWS)

Code repository: terraform-execute-redshift-sql

Environment: PoC or pilot

Technologies: DevOps; Analytics; Big data; Data lakes

Workload: Open-source

AWS services: Amazon Redshift; AWS CLI

Summary

Using infrastructure as code (IaC) for the deployment and management of Amazon Redshift is a prevalent practice within DevOps. IaC facilitates the deployment and configuration of various Amazon Redshift resources, such as clusters, snapshots, and parameter groups. However, IaC doesn’t extend to the management of database resources like tables, schemas, views, and stored procedures. These database elements are managed through SQL queries and are not directly supported by IaC tools. Although solutions and tools exist for managing these resources, you might prefer not to introduce additional tools into your technology stack.

This pattern outlines a methodology that uses Terraform to deploy Amazon Redshift database resources, including tables, schemas, views, and stored procedures. The pattern distinguishes between two types of SQL queries:

  • Nonrepeatable queries – These queries are executed once during the initial Amazon Redshift deployment to establish the essential database components.

  • Repeatable queries – These queries are immutable and can be rerun without impacting the database. The solution uses Terraform to monitor changes in repeatable queries and apply them accordingly.

For more details, see Solution walkthrough in Additional information.

Prerequisites and limitations

Prerequisites

You must have an active AWS account and install the following on your deployment machine:

Limitations

  • This solution supports a single Amazon Redshift database because Terraform only allows for the creation of one database during cluster creation.

  • This pattern doesn’t include tests to validate changes to the repeatable queries before applying them. We recommend that you incorporate such tests for enhanced reliability.

  • To illustrate the solution, this pattern provides a sample redshift.tf file that uses a local Terraform state file. However, for production environments, we strongly recommend that you employ a remote state file with a locking mechanism for enhanced stability and collaboration.

  • Some AWS services aren’t available in all AWS Regions. For Region availability, see AWS services by Region. For specific endpoints, see Service endpoints and quotas, and choose the link for the service.

Product versions

This solution is developed and tested on Amazon Redshift patch 179.

Code repository

The code for this pattern is available in the GitHub amazon-redshift-sql-deploy-terraform repository.

Architecture

The following diagram illustrates how Terraform manages the Amazon Redshift database resources by handling both nonrepeatable and repeatable SQL queries.

Process for Terraform to manage Amazon Redshift database resources using SQL queries.

The diagram shows the following steps:

  1. Terraform applies nonrepeatable SQL queries during the initial Amazon Redshift cluster deployment.

  2. The developer commits changes to the repeatable SQL queries.

  3. Terraform monitors changes in the repeatable SQL queries.

  4. Terraform applies repeatable SQL queries to the Amazon Redshift database.

The solution provided by this pattern is built based on the Terraform module for Amazon Redshift. The Terraform module provisions an Amazon Redshift cluster and database. To enhance the module, we used terraform_data resources, which invokes a custom Python script to execute SQL queries using the Amazon Redshift ExecuteStatement API operation. As a result, the module can do the following:

  • Deploy any number of database resources by using SQL queries after the database is provisioned.

  • Monitor continuously for changes in the repeatable SQL queries and apply those changes using Terraform.

For more details, see Solution walkthrough in Additional information.

Tools

AWS services

  • Amazon Redshift is a fully managed petabyte-scale data warehouse service in the AWS Cloud.

Other tools

  • Terraform is an infrastructure as code (IaC) tool from HashiCorp that helps you create and manage cloud and on-premises resources.

  • Python is a general-purpose programming language that’s used in this pattern to execute SQL queries.

Best practices

Epics

TaskDescriptionSkills required

Clone the repository.

To clone the Git repository containing the Terraform code for provisioning an Amazon Redshift cluster, use the following command.

git clone https://github.com/aws-samples/amazon-redshift-sql-deploy-terraform.git
DevOps engineer

Update the Terraform variables.

To customize the Amazon Redshift cluster deployment according to your specific requirements, update the following parameters in the terraform.tfvars file.

region = "<AWS_REGION>" cluster_identifier = "<REDSHIFT_CLUSTER_IDENTIFIER>" node_type = "<REDSHIFT_NODE_TYPE>" number_of_nodes = "<REDSHIFT_NODE_COUNT>" database_name = "<REDSHIFT_DB_NAME>" subnet_ids = "<REDSHIFT_SUBNET_IDS>" vpc_security_group_ids = "<REDSHIFT_SECURITY_GROUP_IDS>" run_nonrepeatable_queries = true run_repeatable_queries = true sql_path_bootstrap = "<BOOTSTRAP_SQLS_PATH>" sql_path_nonrepeatable = "<NON-REPEATABLE_SQLS_PATH>" sql_path_repeatable = "<REPEATABLE_SQLS_PATH>" sql_path_finalize = "<FINALIZE_SQLS_PATH>" create_random_password = false master_username = "<REDSHIFT_MASTER_USERNAME>"
DevOps engineer

Deploy the resources using Terraform.

  1. To prepare for the deployment process, use the following command to initialize Terraform within the cloned repository.

    terraform init
  2. To preview the changes that Terraform will apply to the infrastructure, use the following command to create an execution plan.

    terraform plan -var-file terraform.tfvars
  3. To provision the Amazon Redshift cluster and associated resources, use the following command to apply the Terraform execution plan.

    terraform apply -var-file terraform.tfvars
DevOps engineer

(Optional) Execute additional SQL queries.

The sample repository provides several SQL queries for demo purposes. To execute your own SQL queries, add them to the following folders:

/bootstrap

/nonrepeatable

/repeatable

/finalize

TaskDescriptionSkills required

Monitor the deployment of SQL statements.

You can monitor the results of the SQL executions to an Amazon Redshift cluster. For examples of output that show a failed and a successful SQL execution, see Example SQL statements in Additional information.

DBA, DevOps engineer

Clean up resources.

To delete all the resources deployed by Terraform, run the following command.

terraform destroy
DevOps engineer
TaskDescriptionSkills required

Validate the data in the Amazon Redshift cluster.

  1. Sign in to the AWS Management Console, and open the Amazon Redshift console.

  2. On the navigation menu, choose Clusters. Choose the relevant cluster name in the list.

  3. Follow the instructions in Querying a database using the Amazon Redshift query editor v2 in the Amazon Redshift documentation.

DBA, AWS DevOps

Related resources

AWS documentation

Other resources

Additional information

Solution walkthrough

To use the solution, you must organize your Amazon Redshift SQL queries in a specific way. All SQL queries must be stored in files with a .sql extension.

In the code example provided with this pattern, the SQL queries are organized in the following folder structure. You can modify the code (sql-queries.tf and sql-queries.py) to work with any structure that fits your unique use case.

/bootstrap |- Any # of files |- Any # of sub-folders /nonrepeatable |- Any # of files |- Any # of sub-folders /repeatable /udf |- Any # of files |- Any # of sub-folders /table |- Any # of files |- Any # of sub-folders /view |- Any # of files |- Any # of sub-folders /stored-procedure |- Any # of files |- Any # of sub-folders /finalize |- Any # of files |- Any # of sub-folders

Given the preceding folder structure, during Amazon Redshift cluster deployment, Terraform executes the queries in the following order:

  1. /bootstrap

  2. /nonrepeatable

  3. /repeatable

  4. /finalize

The /repeatable folder contains four subfolders: /udf, /table, /view, and /stored-procedure. These subfolders indicate the order in which Terraform executes the SQL queries.

The Python script that executes the SQL queries is sql-queries.py. First, the script reads all the files and subfolders of a specific source directory, for example, the sql_path_bootstrap parameter. Then the script executes the queries by calling the Amazon Redshift ExecuteStatement API operation. You might have one or more SQL queries in a file. The following code snippet shows the Python function that executes SQL statements stored in a file against an Amazon Redshift cluster.

def execute_sql_statement(filename, cluster_id, db_name, secret_arn, aws_region): """Execute SQL statements in a file""" redshift_client = boto3.client( 'redshift-data', region_name=aws_region) contents = get_contents_from_file(filename), response = redshift_client.execute_statement( Sql=contents[0], ClusterIdentifier=cluster_id, Database=db_name, WithEvent=True, StatementName=filename, SecretArn=secret_arn ) ...

The Terraform script sql-queries.tf creates the terraform_data resources that invoke the sql-queries.py script. There is a terraform_data resource for each of the four folders: /bootstrap, /nonrepeatable, /repeatable, and /finalize. The following code snippet shows the terraform_data resource that execute the SQL queries in the /bootstrap folder.

locals { program = "${path.module}/sql-queries.py" redshift_cluster_name = try(aws_redshift_cluster.this[0].id, null) } resource "terraform_data" "run_bootstrap_queries" { count = var.create && var.run_nonrepeatable_queries && (var.sql_path_bootstrap != "") && (var.snapshot_identifier == null) ? 1 : 0 depends_on = [aws_redshift_cluster.this[0]] provisioner "local-exec" { command = "python3 ${local.program} ${var.sql_path_bootstrap} ${local.redshift_cluster_name} ${var.database_name} ${var.redshift_secret_arn} ${local.aws_region}" } }

You can control whether to run these queries by using the following variables. If you don’t want to run queries in sql_path_bootstrap, sql_path_nonrepeatable, sql_path_repeatable, or sql_path_finalize, set their values to "".

run_nonrepeatable_queries = true run_repeatable_queries = true sql_path_bootstrap = "src/redshift/bootstrap" sql_path_nonrepeatable = "src/redshift/nonrepeatable" sql_path_repeatable = "src/redshift/repeatable" sql_path_finalize = "src/redshift/finalize"

When you run terraform apply, Terraform considers the terraform_data resource added after the script is completed, regardless of the results of the script. If some SQL queries failed, and you want to rerun them, you can manually remove the resource from the Terraform state, and run terraform apply again. For example, the following command removes the run_bootstrap_queries resource from the Terraform state.

terraform state rm module.redshift.terraform_data.run_bootstrap_queries[0]

The following code example shows how the run_repeatable_queries resource monitors changes in the repeatable folder by using the sha256 hash. If any file within the folder is updated, Terraform marks the entire directory for an update. Then, Terraform runs the queries in the directory again during the next terraform apply.

resource "terraform_data" "run_repeatable_queries" { count = var.create_redshift && var.run_repeatable_queries && (var.sql_path_repeatable != "") ? 1 : 0 depends_on = [terraform_data.run_nonrepeatable_queries] # Continuously monitor and apply changes in the repeatable folder triggers_replace = { dir_sha256 = sha256(join("", [for f in fileset("${var.sql_path_repeatable}", "**") : filesha256("${var.sql_path_repeatable}/${f}")])) } provisioner "local-exec" { command = "python3 ${local.sql_queries} ${var.sql_path_repeatable} ${local.redshift_cluster_name} ${var.database_name} ${var.redshift_secret_arn}" } }

To refine the code, you can implement a mechanism to detect and apply changes only to the files that have been updated within the repeatable folder, rather than applying changes to all files indiscriminately.

Example SQL statements

The following output shows a failed SQL execution, along with an error message.

module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): Executing: ["/bin/sh" "-c" "python3 modules/redshift/sql-queries.py src/redshift/nonrepeatable testcluster-1 db1 arn:aws:secretsmanager:us-east-1:XXXXXXXXXXXX:secret:/redshift/master_user/password-8RapGH us-east-1"] module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): ------------------------------------------------------------------- module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): src/redshift/nonrepeatable/table/admin/admin.application_family.sql module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): ------------------------------------------------------------------- module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): Status: FAILED module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): SQL execution failed. module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): Error message: ERROR: syntax error at or near ")" module.redshift.terraform_data.run_nonrepeatable_queries[0] (local-exec): Position: 244 module.redshift.terraform_data.run_nonrepeatable_queries[0]: Creation complete after 3s [id=ee50ba6c-11ae-5b64-7e2f-86fd8caa8b76]

The following output shows a successful SQL execution.

module.redshift.terraform_data.run_bootstrap_queries[0]: Provisioning with 'local-exec'... module.redshift.terraform_data.run_bootstrap_queries[0] (local-exec): Executing: ["/bin/sh" "-c" "python3 modules/redshift/sql-queries.py src/redshift/bootstrap testcluster-1 db1 arn:aws:secretsmanager:us-east-1:XXXXXXXXXXXX:secret:/redshift/master_user/password-8RapGH us-east-1"] module.redshift.terraform_data.run_bootstrap_queries[0] (local-exec): ------------------------------------------------------------------- module.redshift.terraform_data.run_bootstrap_queries[0] (local-exec): src/redshift/bootstrap/db.sql module.redshift.terraform_data.run_bootstrap_queries[0] (local-exec): ------------------------------------------------------------------- module.redshift.terraform_data.run_bootstrap_queries[0] (local-exec): Status: FINISHED module.redshift.terraform_data.run_bootstrap_queries[0] (local-exec): SQL execution successful. module.redshift.terraform_data.run_bootstrap_queries[0]: Creation complete after 2s [id=d565ef6d-be86-8afd-8e90-111e5ea4a1be]