Converting ETL processes to AWS Glue in AWS Schema Conversion Tool - AWS Schema Conversion Tool

Converting ETL processes to AWS Glue in AWS Schema Conversion Tool

In the following sections, you can find a description of a conversion that calls AWS Glue API operations in Python. For more information, see Program AWS Glue ETL scripts in Python in the AWS Glue Developer Guide.

Step 1: Create a database

The first step is to create a new database in an AWS Glue Data Catalog by using the AWS SDK API. When you define a table in the Data Catalog, you add it to a database. A database is used to organize the tables in AWS Glue.

The following example demonstrates the create_database method of the Python API for AWS Glue.

response = client.create_database( DatabaseInput={ 'Name': 'database_name’, 'Description': 'description', 'LocationUri': 'string', 'Parameters': { 'parameter-name': 'parameter value' } } )

If you are using Amazon Redshift, the database name is formed as follows.

{redshift_cluster_name}_{redshift_database_name}_{redshift_schema_name}

The full name of Amazon Redshift cluster for this example is as follows.

rsdbb03.apq1mpqso.us-west-2.redshift.amazonaws.com

The following shows an example of a well-formed database name. In this case rsdbb03 is the name, which is the first part of the full name of the cluster endpoint. The database is named dev and the schema is ora_glue.

rsdbb03_dev_ora_glue

Step 2: Create a connection

Create a new connection in a Data Catalog by using the AWS SDK API.

The following example demonstrates using the create_connection method of the Python API for AWS Glue.

response = client.create_connection( ConnectionInput={ 'Name': 'Redshift_abcde03.aabbcc112233.us-west-2.redshift.amazonaws.com_dev', 'Description': 'Created from SCT', 'ConnectionType': 'JDBC', 'ConnectionProperties': { 'JDBC_CONNECTION_URL': 'jdbc:redshift://aabbcc03.aabbcc112233.us-west-2.redshift.amazonaws.com:5439/dev', 'USERNAME': 'user_name', 'PASSWORD': 'password' }, 'PhysicalConnectionRequirements': { 'AvailabilityZone': 'us-west-2c', 'SubnetId': 'subnet-a1b23c45', 'SecurityGroupIdList': [ 'sg-000a2b3c', 'sg-1a230b4c', 'sg-aba12c3d', 'sg-1abb2345' ] } } )

The parameters used in create_connection are as follows:

  • Name (UTF-8 string) – required. For Amazon Redshift, the connection name is formed as follows: Redshift_<Endpoint-name>_<redshift-database-name>, for example: Redshift_abcde03_dev

  • Description (UTF-8 string) – Your description of the connection.

  • ConnectionType (UTF-8 string) – Required. The type of connection. Currently, only JDBC is supported; SFTP is not supported.

  • ConnectionProperties (dict) – Required. A list of key-value pairs used as parameters for this connection, including the JDBC connection URL, the user name, and the password.

  • PhysicalConnectionRequirements (dict) – Physical connection requirements, which include the following:

    • SubnetId (UTF-8 string) – The ID of the subnet used by the connection.

    • SecurityGroupIdList (list) – The security group ID list used by the connection.

    • AvailabilityZone (UTF-8 string) – Required. The Availability Zone that contains the endpoint. This parameter is deprecated.

Step 3: Create an AWS Glue crawler

Next, you create an AWS Glue crawler to populate the AWS Glue catalog. For more information, see Cataloging tables with a crawler in the AWS Glue Developer Guide.

The first step in adding a crawler is to create a new database in a Data Catalog by using the AWS SDK API. Before you begin, make sure to first delete any previous version of it by using the delete_crawler operation.

When you create your crawler, a few considerations apply:

  • For the crawler name, use the format <redshift_node_name>_<redshift_database_name>_<redshift_shema_name>, for example: abcde03_dev_ora_glue

  • Use an IAM role that already exists. For more information on creating IAM roles, see Creating IAM roles in the IAM User Guide.

  • Use the name of the database that you created in the previous steps.

  • Use the ConnectionName parameter, which is required.

  • For the path parameter, use the path to the JDBC target, for example: dev/ora_glue/%

The following example deletes an existing crawler and then creates a new one by using the Python API for AWS Glue.

response = client.delete_crawler( Name='crawler_name' ) response = client.create_crawler( Name='crawler_name', Role= ‘IAM_role’, DatabaseName='database_name’, Description='string', Targets={ 'S3Targets': [ { 'Path': 'string', 'Exclusions': [ 'string', ] }, ], 'JdbcTargets': [ { 'ConnectionName': ‘ConnectionName’, 'Path': ‘Include_path’, 'Exclusions': [ 'string', ] }, ] }, Schedule='string', Classifiers=[ 'string', ], TablePrefix='string', SchemaChangePolicy={ 'UpdateBehavior': 'LOG'|'UPDATE_IN_DATABASE', 'DeleteBehavior': 'LOG'|'DELETE_FROM_DATABASE'|'DEPRECATE_IN_DATABASE' }, Configuration='string' )

Create and then run a crawler that connects to one or more data stores, determines the data structures, and writes tables into the Data Catalog. You can run your crawler on a schedule, as shown following.

response = client.start_crawler( Name='string' )

This example uses Amazon Redshift as the target. Amazon Redshift data types map to AWS Glue data types in the following way after the crawler runs.

Amazon Redshift data type AWS Glue data type
smallint smallint
integer int
bigint bigint
decimal decimal(18,0)
decimal(p,s) decimal(p,s)
real double
double precision double
boolean boolean
char string
varchar string
varchar(n) string
date date
timestamp timestamp
timestamptz timestamp