Getting started with Aurora zero-ETL integrations with Amazon Redshift
Before you create a zero-ETL integration with Amazon Redshift, configure your Aurora DB cluster and your Amazon Redshift data warehouse with the required parameters and permissions. During setup, you'll complete the following steps:
After you complete these tasks, continue to Creating Aurora zero-ETL integrations with Amazon Redshift.
You can use the AWS SDKs to automate the setup process for you. For more information, see Set up an integration using the AWS SDKs (Aurora MySQL only).
Step 1: Create a custom DB cluster parameter group
Aurora zero-ETL integrations with Amazon Redshift require specific values for the
DB cluster parameters that control replication. Specifically, Aurora MySQL requires
enhanced binlog (aurora_enhanced_binlog
), and
Aurora PostgreSQL requires enhanced logical replication
(aurora.enhanced_logical_replication
).
To configure binary logging or logical replication, you must first create a custom DB cluster parameter group, and then associate it with the source DB cluster.
Create a custom DB cluster parameter group with the following settings depending on your source DB engine. For instructions to create a parameter group, see DB cluster parameter groups for Amazon Aurora DB clusters.
Aurora MySQL (aurora-mysql8.0 family):
-
aurora_enhanced_binlog=1
-
binlog_backup=0
-
binlog_format=ROW
-
binlog_replication_globaldb=0
-
binlog_row_image=full
-
binlog_row_metadata=full
In addition, make sure that the binlog_transaction_compression
parameter is not set to ON
, and that the
binlog_row_value_options
parameter is not set
to PARTIAL_JSON
.
For more information about Aurora MySQL enhanced binlog, see Setting up enhanced binlog for Aurora MySQL.
Aurora PostgreSQL (aurora-postgresql15 family):
Note
For Aurora PostgreSQL DB clusters, you must create the custom parameter group within the
Amazon RDS Database Preview
Environment
-
rds.logical_replication=1
-
aurora.enhanced_logical_replication=1
-
aurora.logical_replication_backup=0
-
aurora.logical_replication_globaldb=0
Enabling enhanced logical replication
(aurora.enhanced_logical_replication
) automatically sets the
REPLICA IDENTITY
parameter to FULL
, which means that
all column values are written to the write ahead log (WAL). This will increase the
IOPS for your source DB cluster.
Step 2: Select or create a source DB cluster
After you create a custom DB cluster parameter group, choose or create an Aurora MySQL or Aurora PostgreSQL DB cluster. This cluster will be the source of data replication to Amazon Redshift. For instructions to create a DB cluster, see Creating an Amazon Aurora DB cluster.
The database must be running a supported DB engine version. For a list of supported versions, see Supported Regions and Aurora DB engines for zero-ETL integrations with Amazon Redshift.
Note
You must create Aurora PostgreSQL DB clusters within the Amazon RDS Database Preview
Environment
When you create the database, under Additional configuration, change the default DB cluster parameter group to the custom parameter group that you created in the previous step.
Note
For Aurora MySQL, if you associate the parameter group with the DB cluster after the cluster is already created, you must reboot the primary DB instance in the cluster to apply the changes before you can create a zero-ETL integration. For instructions, see Rebooting an Amazon Aurora DB cluster or Amazon Aurora DB instance.
During the preview release of Aurora PostgreSQL zero-ETL integrations with Amazon Redshift, you must associate the cluster with the custom DB cluster parameter group while creating the cluster. You can't perform this action after the source DB cluster is already created, otherwise you need to delete and recreate the cluster.
Step 3: Create a target Amazon Redshift data warehouse
After you create your source DB cluster, you must create and configure a target data warehouse in Amazon Redshift. The data warehouse must meet the following requirements:
-
Created in preview (for Aurora PostgreSQL sources only). For Aurora MySQL sources, you must create production clusters and workgroups.
-
To create a provisioned cluster in preview, choose Create preview cluster from the banner on the provisioned clusters dashboard. For more information, see Creating a preview cluster.
When creating the cluster, set the Preview track to
preview_2023
. -
To create a Redshift Serverless workgroup in preview, choose Create preview workgroup from the banner on the Serverless dashboard. For more information, see Creating a preview workgroup.
-
-
Using an RA3 node type , or Redshift Serverless.
-
Encrypted (if using a provisioned cluster). For more information, see Amazon Redshift database encryption.
For instructions to create a data warehouse, see Creating a cluster for provisioned clusters, or Creating a workgroup with a namespace for Redshift Serverless.
Enable case sensitivity on the data warehouse
For the integration to be successful, the case sensitivity parameter (enable_case_sensitive_identifier
) must be enabled for
the data warehouse. By default, case sensitivity is disabled on all provisioned
clusters and Redshift Serverless workgroups.
To enable case sensitivity, perform the following steps depending on your data warehouse type:
-
Provisioned cluster – To enable case sensitivity on a provisioned cluster, create a custom parameter group with the
enable_case_sensitive_identifier
parameter enabled. Then, associate the parameter group with the cluster. For instructions, see Managing parameter groups using the console or Configuring parameter values using the AWS CLI.Note
Remember to reboot the cluster after you associate the custom parameter group with it.
-
Serverless workgroup – To enable case sensitivity on a Redshift Serverless workgroup, you must use the AWS CLI. The Amazon Redshift console doesn't currently support modifying Redshift Serverless parameter values. Send the following update-workgroup request:
aws redshift-serverless update-workgroup \ --workgroup-name
target-workgroup
\ --config-parameters parameterKey=enable_case_sensitive_identifier,parameterValue=trueYou don't need to reboot a workgroup after you modify its parameter values.
Configure authorization for the data warehouse
After you create a data warehouse, you must configure the source Aurora DB cluster as an authorized integration source. For instructions, see Configure authorization for your Amazon Redshift data warehouse.
Set up an integration using the AWS SDKs (Aurora MySQL only)
Rather than setting up each resource manually, you can run the following Python script
to automatically set up the required resources for you. The code example uses the
AWS SDK for Python (Boto3)
To install the required dependencies, run the following commands:
pip install boto3 pip install time
Within the script, optionally modify the names of the source, target, and parameter
groups. The final function creates an integration named my-integration
after the resources are set up.
import boto3 import time # Build the client using the default credential configuration. # You can use the CLI and run 'aws configure' to set access key, secret # key, and default Region. rds = boto3.client('rds') redshift = boto3.client('redshift') sts = boto3.client('sts') source_cluster_name = 'my-source-cluster' # A name for the source cluster source_param_group_name = 'my-source-param-group' # A name for the source parameter group target_cluster_name = 'my-target-cluster' # A name for the target cluster target_param_group_name = 'my-target-param-group' # A name for the target parameter group def create_source_cluster(*args): """Creates a source Aurora MySQL DB cluster""" response = rds.create_db_cluster_parameter_group( DBClusterParameterGroupName=source_param_group_name, DBParameterGroupFamily='aurora-mysql8.0', Description='For Aurora MySQL zero-ETL integrations' ) print('Created source parameter group: ' + response['DBClusterParameterGroup']['DBClusterParameterGroupName']) response = rds.modify_db_cluster_parameter_group( DBClusterParameterGroupName=source_param_group_name, Parameters=[ { 'ParameterName': 'aurora_enhanced_binlog', 'ParameterValue': '1', 'ApplyMethod': 'pending-reboot' }, { 'ParameterName': 'binlog_backup', 'ParameterValue': '0', 'ApplyMethod': 'pending-reboot' }, { 'ParameterName': 'binlog_format', 'ParameterValue': 'ROW', 'ApplyMethod': 'pending-reboot' }, { 'ParameterName': 'binlog_replication_globaldb', 'ParameterValue': '0', 'ApplyMethod': 'pending-reboot' }, { 'ParameterName': 'binlog_row_image', 'ParameterValue': 'full', 'ApplyMethod': 'pending-reboot' }, { 'ParameterName': 'binlog_row_metadata', 'ParameterValue': 'full', 'ApplyMethod': 'pending-reboot' } ] ) print('Modified source parameter group: ' + response['DBClusterParameterGroupName']) response = rds.create_db_cluster( DBClusterIdentifier=source_cluster_name, DBClusterParameterGroupName=source_param_group_name, Engine='aurora-mysql', EngineVersion='8.0.mysql_aurora.3.05.2', DatabaseName='myauroradb', MasterUsername='
username
', MasterUserPassword='Password01**
' ) print('Creating source cluster: ' + response['DBCluster']['DBClusterIdentifier']) source_arn = (response['DBCluster']['DBClusterArn']) create_target_cluster(target_cluster_name, source_arn, target_param_group_name) response = rds.create_db_instance( DBInstanceClass='db.r6g.2xlarge', DBClusterIdentifier=source_cluster_name, DBInstanceIdentifier=source_cluster_name + '-instance', Engine='aurora-mysql' ) return(response) def create_target_cluster(target_cluster_name, source_arn, target_param_group_name): """Creates a target Redshift cluster""" response = redshift.create_cluster_parameter_group( ParameterGroupName=target_param_group_name, ParameterGroupFamily='redshift-1.0', Description='For Aurora MySQL zero-ETL integrations' ) print('Created target parameter group: ' + response['ClusterParameterGroup']['ParameterGroupName']) response = redshift.modify_cluster_parameter_group( ParameterGroupName=target_param_group_name, Parameters=[ { 'ParameterName': 'enable_case_sensitive_identifier', 'ParameterValue': 'true' } ] ) print('Modified target parameter group: ' + response['ParameterGroupName']) response = redshift.create_cluster( ClusterIdentifier=target_cluster_name, NodeType='ra3.4xlarge', NumberOfNodes=2, Encrypted=True, MasterUsername='username
', MasterUserPassword='Password01**
', ClusterParameterGroupName=target_param_group_name ) print('Creating target cluster: ' + response['Cluster']['ClusterIdentifier']) # Retrieve the target cluster ARN response = redshift.describe_clusters( ClusterIdentifier=target_cluster_name ) target_arn = response['Clusters'][0]['ClusterNamespaceArn'] # Retrieve the current user's account ID response = sts.get_caller_identity() account_id = response['Account'] # Create a resource policy specifying cluster ARN and account ID response = redshift.put_resource_policy( ResourceArn=target_arn, Policy=''' { \"Version\":\"2012-10-17\", \"Statement\":[ {\"Effect\":\"Allow\", \"Principal\":{ \"Service\":\"redshift.amazonaws.com\" }, \"Action\":[\"redshift:AuthorizeInboundIntegration\"], \"Condition\":{ \"StringEquals\":{ \"aws:SourceArn\":\"%s\"} } }, {\"Effect\":\"Allow\", \"Principal\":{ \"AWS\":\"arn:aws:iam::%s:root\"}, \"Action\":\"redshift:CreateInboundIntegration\"} ] } ''' % (source_arn, account_id) ) return(response) def wait_for_cluster_availability(*args): """Waits for both clusters to be available""" print('Waiting for clusters to be available...') response = rds.describe_db_clusters( DBClusterIdentifier=source_cluster_name ) source_status = response['DBClusters'][0]['Status'] source_arn = response['DBClusters'][0]['DBClusterArn'] response = rds.describe_db_instances( DBInstanceIdentifier=source_cluster_name + '-instance' ) source_instance_status = response['DBInstances'][0]['DBInstanceStatus'] response = redshift.describe_clusters( ClusterIdentifier=target_cluster_name ) target_status = response['Clusters'][0]['ClusterStatus'] target_arn = response['Clusters'][0]['ClusterNamespaceArn'] # Every 60 seconds, check whether the clusters are available. if source_status != 'available' or target_status != 'available' or source_instance_status != 'available': time.sleep(60) response = wait_for_cluster_availability( source_cluster_name, target_cluster_name) else: print('Clusters available. Ready to create zero-ETL integration.') create_integration(source_arn, target_arn) return def create_integration(source_arn, target_arn): """Creates a zero-ETL integration using the source and target clusters""" response = rds.create_integration( SourceArn=source_arn, TargetArn=target_arn, IntegrationName='my-integration
' ) print('Creating integration: ' + response['IntegrationName']) def main(): """main function""" create_source_cluster(source_cluster_name, source_param_group_name) wait_for_cluster_availability(source_cluster_name, target_cluster_name) if __name__ == "__main__": main()
Next steps
With a source Aurora DB cluster and an Amazon Redshift target data warehouse, you can now create a zero-ETL integration and replicate data. For instructions, see Creating Aurora zero-ETL integrations with Amazon Redshift.