Unload data from an Amazon Redshift cluster across accounts to Amazon S3 - AWS Prescriptive Guidance

Unload data from an Amazon Redshift cluster across accounts to Amazon S3

Created by Andrew Kamel (AWS)

Code repository: aws-unload-redshift-to-s3-python

Environment: Production

Technologies: Databases; Analytics; Serverless

Workload: Open-source

AWS services: AWS Lambda; Amazon Redshift; Amazon S3; AWS Secrets Manager

Summary

When you test applications, it's helpful to have production data in your test environment. Using production data can give you a more accurate assessment of the application that you're developing.

This pattern extracts data from an Amazon Redshift cluster in a production environment to an Amazon Simple Storage Service (Amazon S3) bucket in a development environment on Amazon Web Services (AWS).

The pattern steps through the setup of both DEV and PROD accounts, including the following:

  • Required resources

  • AWS Identity and Access Management (IAM) roles

  • Network adjustments to subnets, security groups, and the virtual private cloud (VPC) to support the Amazon Redshift connection

  • An example AWS Lambda function with a Python runtime for testing the architecture

To grant access to the Amazon Redshift cluster, the pattern uses AWS Secrets Manager to store the relevant credentials. The benefit is having all the needed information to directly connect to the Amazon Redshift cluster without needing to know where the Amazon Redshift cluster resides. Additionally, you can monitor use of the secret.

The secret saved in Secrets Manager includes the Amazon Redshift cluster's host, database name, port, and relevant credentials.

For information about security considerations when using this pattern, see the Best practices section.

Prerequisites and limitations

Prerequisites 

Limitations 

  • Depending on the amount of data that you want to query, the Lambda function might time out.

    If your run takes more time than the maximum Lambda timeout (15 minutes), use an asynchronous approach for your Lambda code. The code example for this pattern uses the psycopg2 library for Python, which doesn't currently support asynchronous processing.

  • Some AWS services aren’t available in all AWS Regions. For Region availability, see AWS services by Region. For specific endpoints, see the Service endpoints and quotas page, and choose the link for the service.

Architecture

The following diagram shows the target architecture, with DEV and PROD accounts.

The Lambda VPC in the DEV account and the Amazon Redshift VPC in the PROD account.

The diagram shows the following workflow:

  1. The Lambda function in the DEV account assumes the IAM role that's required to access the Amazon Redshift credentials in Secrets Manager in the PROD account.

    The Lambda function then retrieves the Amazon Redshift cluster secret.

  2. The Lambda function in the DEV account uses the information to connect to the Amazon Redshift cluster in the PROD account through the peered VPCs.

    The Lambda function then sends an unload command to query the Amazon Redshift cluster in the PROD account.

  3. The Amazon Redshift cluster in the PROD account assumes the relevant IAM role to access the S3 bucket in the DEV account.

    The Amazon Redshift cluster unloads the queried data to the S3 bucket in the DEV account.

Querying data from Amazon Redshift

The following diagram shows the roles that are used to retrieve the Amazon Redshift credentials and connect to the Amazon Redshift cluster. The workflow is initiated by the Lambda function.

The three-step process for assuming roles across accounts.

The diagram shows the following workflow:

  1. The CrossAccount-SM-Read-Role in the DEV account assumes the SM-Read-Role in the PROD account.

  2. The SM-Read-Role role uses the attached policy to retrieve the secret from Secrets Manager.

  3. The credentials are used to access the Amazon Redshift cluster.

Uploading data to Amazon S3

The following diagram shows the cross-account read-write process for extracting data and uploading it to Amazon S3. The workflow is initiated by the Lambda function. The pattern chains IAM roles in Amazon Redshift. The unload command that comes from the Amazon Redshift cluster assumes the CrossAccount-S3-Write-Role, and then assumes the S3-Write-Role. This role chaining gives Amazon Redshift access to Amazon S3.

The roles that get credentials, access Amazon Redshift, and upload data to Amazon S3.

The workflow includes the following steps:

  1. The CrossAccount-SM-Read-Role in the DEV account assumes the SM-Read-Role in the PROD account.

  2. The SM-Read-Role retrieves the Amazon Redshift credentials from Secrets Manager.

  3. The Lambda function connects to the Amazon Redshift cluster and sends a query.

  4. The Amazon Redshift cluster assumes the CrossAccount-S3-Write-Role.

  5. The CrossAccount-S3-Write-Role assumes the S3-Write-Role in the DEV account.

  6. The query results are unloaded to the S3 bucket in the DEV account.

Tools

AWS services

  • AWS Key Management Service (AWS KMS) helps you create and control cryptographic keys to help protect your data.

  • AWS Lambda is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use.

  • Amazon Redshift is a managed petabyte-scale data warehouse service in the AWS Cloud.

  • AWS Secrets Manager helps you replace hardcoded credentials in your code, including passwords, with an API call to Secrets Manager to retrieve the secret programmatically.

  • Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.

Code repository

The code for this pattern is available in the GitHub unload-redshift-to-s3-python repository.

Best practices

Security disclaimer

Before you implement this solution, consider the following important security recommendations:

  • Remember that connecting development and production accounts can increase the scope and lower overall security posture. We recommend deploying this solution only temporarily, extracting the required portion of data and then immediately destroying the deployed resources. To destroy the resources, you should delete the Lambda function, remove any IAM roles and policies created for this solution, and revoke any network access that was granted between the accounts.

  • Consult your security and compliance teams before copying any data from production to development environments. Personally identifiable information (PII), Protected health information (PHI), and other confidential or regulated data should generally not be copied in this manner. Copy only publicly available, non-confidential information (for example, public stock data from a shop frontend). Consider tokenizing or anonymizing data, or generating synthetic test data, instead of using production data whenever possible. One of the AWS security principles is to keep people away from data. In other words, developers should not perform operations in the production account.

  • Restrict access to the Lambda function in the development account because it can read data from the Amazon Redshift cluster in the production environment.

  • To avoid disrupting the production environment, implement the following recommendations:

    • Use a separate, dedicated development account for testing and development activities.

    • Implement strict network access controls and limit traffic between accounts to only what is necessary.

    • Monitor and audit access to the production environment and data sources.

    • Implement least-privilege access controls for all resources and services involved.

    • Regularly review and rotate credentials, such as AWS Secrets Manager secrets and IAM role access keys.

  • Refer to the following security documentation for the services used in this article:

Security is a top priority when accessing production data and resources. Always follow best practices, implement least-privilege access controls, and regularly review and update your security measures.

Epics

TaskDescriptionSkills required

Create a secret for the Amazon Redshift cluster.

To create the secret for the Amazon Redshift cluster, do the following:

  1. In the PROD account, sign in to the AWS Management Console, and open the Secrets Manager console at https://console.aws.amazon.com/secretsmanager/.

  2. Choose Store a new Secret.

  3. Select Credentials for Amazon Redshift data warehouse.

  4. For User name and Password, enter values for your instance, and confirm or choose a value for Encryption key.

  5. Choose the Amazon Redshift data warehouse that your secret will access.

  6. Enter Redshift-Creds-Secret for the secret name.

  7. Complete the remaining the creation steps with the default choices, and then choose Store.

  8. View your secret, and note the Secret ARN value that was generated to identify the secret.

DevOps engineer

Create a role to access Secrets Manager.

To create the role, do the following:

  1. In the PROD account, open the IAM console at https://console.aws.amazon.com/iam/.

  2. Choose Policies.

  3. Choose Create policy.

  4. Choose the JSON tab, and then enter an IAM policy such as the following:

    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "secretsmanager:GetResourcePolicy", "secretsmanager:GetSecretValue", "secretsmanager:DescribeSecret", "secretsmanager:ListSecretVersionIds" ], "Resource": [ "<Redshift-Creds-Secret-ARN>" ] }, { "Effect": "Allow", "Action": "secretsmanager:ListSecrets", "Resource": "*" } ] }

    Replace Redshift-Creds-Secret-ARN with the Amazon Resource Name (ARN) of the Secrets Manager secret that contains the information and credentials for the Amazon Redshift cluster.

DevOps engineer
TaskDescriptionSkills required

Create a role to access the S3 bucket.

To create the role for accessing the S3 bucket, do the following:

  1. In the DEV account, open the IAM console.

  2.  Choose Policies.

  3. Choose Create policy.

  4.  Choose the JSON tab, and then enter an IAM policy such as the following:

    { "Version": "2012-10-17", "Statement": [ { "Sid": "kmsstmt", "Effect": "Allow", "Action": [ "kms:Decrypt", "kms:Encrypt", "kms:GenerateDataKey" ], "Resource": [ "<kms-key-arn>" ] }, { "Sid": "s3stmt", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:Get*", "s3:List*" ], "Resource": [ "arn:aws:s3:::mybucket", "arn:aws:s3:::mybucket/*" ] } ] }

    Replace mybucket with the name of the S3 bucket that you want to access. Also, if the S3 bucket is encrypted, replace kms-key-arn with the ARN of the AWS Key Management Service (AWS KMS) key used to encrypt the S3 bucket. Otherwise, you don't need the AWS KMS section in the policy.

  5. Choose Review policy, enter S3-Write-Policy as the policy name, and then choose Create policy.

  6. In the navigation pane, choose Roles.

  7.  Choose Create role.

  8. For the trusted entity role, choose Custom trust policy.

  9. Choose Next: Permissions, and then select the S3-Write-Policy policy that you created.

  10. Enter S3-Write-Role as the role name, and then choose Create role.

DevOps engineer

Create the Amazon Redshift role.

To create the Amazon Redshift role, do the following:

  1. In the PROD account, open the IAM console.

  2. Choose Policies.

  3. Choose Create policy.

  4. Choose the JSON tab, and then enter an IAM policy such as the following:

    { "Version": "2012-10-17", "Statement": [ { "Sid": "CrossAccountPolicy", "Effect": "Allow", "Action": "sts:AssumeRole", "Resource": "S3-Write-Role-ARN" } ] }

    Replace S3-Write-Role-ARN with the ARN for the S3-Write-Role in the DEV account.

  5. Choose Review policy, enter S3-Write-Role-Assume-Policy as the policy name, and then choose Create policy.

  6. In the navigation pane, choose Roles, and then choose Create role.

  7. Choose AWS service as your trusted entity type, and then choose Redshift, Redshift - Customizable.

  8. Choose Next: Permissions, and then select the S3-Write-Role-Assume-Policy policy that you created.

  9. Enter CrossAccount-S3-Write-Role as the role name, and then choose Create role.

  10. Associate the IAM role with your Amazon Redshift cluster.

DevOps engineer
TaskDescriptionSkills required

Deploy the Lambda function.

To deploy a Lambda function in the peered VPC, do the following:

  1. Open the Lambda console at https://console.aws.amazon.com/lambda/.

  2. Choose Functions.

  3. Choose Create function.

  4. Under Basic information, for Function name, enter a name for your function.

  5. For Runtime, choose Python 3.8.

  6. Expand Change default execution role, and then do the following:

    1. Choose Use an existing role.

    2. For Existing role, select the CrossAccount-RM-Read-Role Lambda role that you created previously.

  7. Expand Advanced settings, and do the following:

    1. Select the Enable VPC check box.

    2. For VPC, select the peered VPC in the DEV account.

    3. For Subnets, select the private subnet.

    4. For Security groups, select the default security group.

  8. Choose Create function.

  9. Add the psycopg2 library as a layer to the Lambda function.

    Note: You can use an already deployed layer from the psycopg2-lambda-layer repository. Make sure to use the URL based on your AWS Region and Python runtime.

DevOps engineer
TaskDescriptionSkills required

Import the required resources.

To import the required resources, run the following commands:

import ast import boto3 import psycopg2 import base64 from botocore.exceptions import ClientError
App developer

Run the Lambda handler function.

The Lambda function uses AWS Security Token Service (AWS STS) for cross-account access and temporary credential management. The function uses the AssumeRole API operation to temporarily assume the permissions of the sm_read_role IAM role.

To run the Lambda function, use the following example code:

def lambda_handler(event, context): sts_client = boto3.client('sts') # Secrets Manager Configurations secret_name = "redshift_creds" sm_region = "eu-west-1" sm_read_role = "arn:aws:iam::PROD_ACCOUNT_NUMBER:role/SM-Read-Role" # S3 Bucket Configurations s3_bucket_path = "s3://mybucket/" s3_bucket_region = "eu-west-1" s3_write_role = "arn:aws:iam::DEV_ACCOUNT_NUMBER:role/S3-Write-Role" # Redshift Configurations sql_query = "select * from category" redshift_db = "dev" redshift_s3_write_role = "arn:aws:iam::PROD_ACCOUNT_NUMBER:role/CrossAccount-S3-Write-Role" chained_s3_write_role = "%s,%s" % (redshift_s3_write_role, s3_write_role) assumed_role_object = sts_client.assume_role( RoleArn=sm_read_role, RoleSessionName="CrossAccountRoleAssumption", ExternalId="YOUR_EXTERNAL_ID", ) credentials = assumed_role_object['Credentials'] secret_dict = ast.literal_eval(get_secret(credentials, secret_name, sm_region)) execute_query(secret_dict, sql_query, s3_bucket_path, chained_s3_write_role, s3_bucket_region, redshift_db) return { 'statusCode': 200 }
App developer

Get the secret.

To get the Amazon Redshift secret, use the following example code:

def get_secret(credentials, secret_name, sm_region): # Create a Secrets Manager client session = boto3.session.Session() sm_client = session.client( service_name='secretsmanager', aws_access_key_id=credentials['AccessKeyId'], aws_secret_access_key=credentials['SecretAccessKey'], aws_session_token=credentials['SessionToken'], region_name=sm_region ) try: get_secret_value_response = sm_client.get_secret_value( SecretId=secret_name ) except ClientError as e: print(e) raise e else: if 'SecretString' in get_secret_value_response: return get_secret_value_response['SecretString'] else: return base64.b64decode(get_secret_value_response['SecretBinary'])
App developer

Run the unload command.

To unload the data to the S3 bucket, use the following example code.

def execute_query(secret_dict, sql_query, s3_bucket_path, chained_s3_write_role, s3_bucket_region, redshift_db): conn_string = "dbname='%s' port='%s' user='%s' password='%s' host='%s'" \ % (redshift_db, secret_dict["port"], secret_dict["username"], secret_dict["password"], secret_dict["host"]) con = psycopg2.connect(conn_string) unload_command = "UNLOAD ('{}') TO '{}' IAM_ROLE '{}' DELIMITER '|' REGION '{}';" \ .format(sql_query, s3_bucket_path + str(datetime.datetime.now()) + ".csv", chained_s3_write_role, s3_bucket_region) # Opening a cursor and run query cur = con.cursor() cur.execute(unload_command) print(cur.fetchone()) cur.close() con.close()
App developer
TaskDescriptionSkills required

Delete the Lambda function.

To avoid incurring unplanned costs, remove the resources and the connection between the DEV and PROD accounts.

To remove the Lambda function, do the following:

  1. Open the AWS Lambda console at https://console.aws.amazon.com/lambda/.

  2. Locate and select the Lambda function you created.

  3. Choose Actions, and then choose Delete.

  4. Confirm the deletion.

DevOps engineer

Remove the IAM roles and policies.

Remove the IAM roles and policies from the DEV and PROD accounts.

In the DEV account, do the following:

  1. Open the IAM console.

  2. Delete the following roles:

    • S3-Write-Role

    • CrossAccount-RM-Read-Role (Lambda role)

  3. Delete the associated policies:

    • S3-Write-Policy

    • The CrossAccount policy for assuming PROD account roles

In the PROD account, do the following:

  1. Open the IAM console.

  2. Delete the following roles:

    • SM-Read-Role

    • CrossAccount-S3-Write-Role

  3. Delete the associated policies:

    • The CrossAccount policy for accessing Secrets Manager

    • S3-Write-Role-Assume-Policy

DevOps engineer

Delete the secret in Secrets Manager.

To delete the secret, do the following:

  1. In the PROD account, open the Secrets Manager console.

  2. Locate and select the secret named Redshift-Creds-Secret.

  3. Choose Actions, and then choose Delete secret.

  4. Confirm the deletion.

DevOps engineer

Remove VPC peering and security group rules.

To remove VPC peering and security group rules, do the following:

  1. In the PROD account, open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.

  2. Navigate to Security Groups.

  3. Find the security group used by the Amazon Redshift cluster.

  4. Edit the inbound rules, and remove the rule that allows connections from the DEV account's Lambda VPC.

  5. Navigate to VPC Peering Connections, and delete the peering connection.

DevOps engineer

Remove data from the S3 bucket.

To remove the data from Amazon S3, do the following:

  1. In the DEV account, open the Amazon S3 console at https://console.aws.amazon.com/s3/.

  2. Locate the bucket you used for data storage.

  3. Delete the objects within the bucket, or delete the entire bucket if it's no longer needed.

DevOps engineer

Clean up AWS KMS keys.

If you created any custom AWS KMS keys for encryption, do the following:

  1. Open the AWS KMS console at https://console.aws.amazon.com/kms/.

  2. Locate any keys created for this pattern.

  3. Schedule the keys for deletion. (There is a mandatory waiting period for key deletion).

DevOps engineer

Review and delete Amazon CloudWatch logs.

To delete the CloudWatch logs, do the following:

  1. Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

  2. Check for any log groups created by your Lambda function or Amazon Redshift cluster.

  3. Delete these log groups if they are no longer needed.

DevOps engineer

Related resources

Additional information

After you unload the data from Amazon Redshift to Amazon S3, you can analyze it by using Amazon Athena.

Amazon Athena is a big data query service that's beneficial when you need to access large volumes of data. You can use Athena without having to provision servers or databases. Athena supports complex queries, and you can run it on different objects.

As with most AWS services, the main benefit to using Athena is that it provides great flexibility in how you run queries without the added complexity. When you use Athena, you can query different data types, such as CSV and JSON, in Amazon S3 without changing the data type. You can query data from various sources, including outside AWS. Athena reduces complexity because you don't have to manage servers. Athena reads data directly from Amazon S3 without loading or changing the data before you run the query.