Step 1: Create a database Step 2: Create a connection Step 3: Create an AWS Glue crawler

Converting ETL processes to AWS Glue in AWS Schema Conversion Tool

In the following sections, you can find a description of a conversion that calls AWS Glue API operations in Python. For more information, see Program AWS Glue ETL scripts in Python in the AWS Glue Developer Guide.

Step 1: Create a database

The first step is to create a new database in an AWS Glue Data Catalog by using the AWS SDK API. When you define a table in the Data Catalog, you add it to a database. A database is used to organize the tables in AWS Glue.

The following example demonstrates the create_database method of the Python API for AWS Glue.


response = client.create_database(
    DatabaseInput={
        'Name': 'database_name’,
        'Description': 'description',
        'LocationUri': 'string',
        'Parameters': {         
            'parameter-name': 'parameter value'
        }
    }
)

If you are using Amazon Redshift, the database name is formed as follows.


{redshift_cluster_name}_{redshift_database_name}_{redshift_schema_name}

The full name of Amazon Redshift cluster for this example is as follows.


rsdbb03.apq1mpqso.us-west-2.redshift.amazonaws.com

The following shows an example of a well-formed database name. In this case rsdbb03 is the name, which is the first part of the full name of the cluster endpoint. The database is named dev and the schema is ora_glue.


rsdbb03_dev_ora_glue

Step 2: Create a connection

Create a new connection in a Data Catalog by using the AWS SDK API.

The following example demonstrates using the create_connection method of the Python API for AWS Glue.


response = client.create_connection(
    ConnectionInput={
        'Name': 'Redshift_abcde03.aabbcc112233.us-west-2.redshift.amazonaws.com_dev',
        'Description': 'Created from SCT',
        'ConnectionType': 'JDBC',
        'ConnectionProperties': {
            'JDBC_CONNECTION_URL': 'jdbc:redshift://aabbcc03.aabbcc112233.us-west-2.redshift.amazonaws.com:5439/dev',
            'USERNAME': 'user_name',
            'PASSWORD': 'password'
        },
        'PhysicalConnectionRequirements': {
            'AvailabilityZone': 'us-west-2c',
            'SubnetId': 'subnet-a1b23c45',
            'SecurityGroupIdList': [
                'sg-000a2b3c', 'sg-1a230b4c', 'sg-aba12c3d', 'sg-1abb2345'
            ]
        }
    }
)

The parameters used in create_connection are as follows:

Name (UTF-8 string) – required. For Amazon Redshift, the connection name is formed as follows: Redshift_<Endpoint-name>_<redshift-database-name>, for example: Redshift_abcde03_dev
Description (UTF-8 string) – Your description of the connection.
ConnectionType (UTF-8 string) – Required. The type of connection. Currently, only JDBC is supported; SFTP is not supported.
ConnectionProperties (dict) – Required. A list of key-value pairs used as parameters for this connection, including the JDBC connection URL, the user name, and the password.
PhysicalConnectionRequirements (dict) – Physical connection requirements, which include the following:
- SubnetId (UTF-8 string) – The ID of the subnet used by the connection.
- SecurityGroupIdList (list) – The security group ID list used by the connection.
- AvailabilityZone (UTF-8 string) – Required. The Availability Zone that contains the endpoint. This parameter is deprecated.

Step 3: Create an AWS Glue crawler

Next, you create an AWS Glue crawler to populate the AWS Glue catalog. For more information, see Cataloging tables with a crawler in the AWS Glue Developer Guide.

The first step in adding a crawler is to create a new database in a Data Catalog by using the AWS SDK API. Before you begin, make sure to first delete any previous version of it by using the delete_crawler operation.

When you create your crawler, a few considerations apply:

For the crawler name, use the format <redshift_node_name>_<redshift_database_name>_<redshift_shema_name>, for example: abcde03_dev_ora_glue
Use an IAM role that already exists. For more information on creating IAM roles, see Creating IAM roles in the IAM User Guide.
Use the name of the database that you created in the previous steps.
Use the ConnectionName parameter, which is required.
For the path parameter, use the path to the JDBC target, for example: dev/ora_glue/%

The following example deletes an existing crawler and then creates a new one by using the Python API for AWS Glue.


response = client.delete_crawler(
    Name='crawler_name'
)

response = client.create_crawler(
    Name='crawler_name',
    Role= ‘IAM_role’,
    DatabaseName='database_name’,
    Description='string',
    Targets={
        'S3Targets': [
            {
                'Path': 'string',
                'Exclusions': [
                    'string',
                ]
            },
        ],
        'JdbcTargets': [
            {
                'ConnectionName': ‘ConnectionName’,
                'Path': ‘Include_path’,
                'Exclusions': [
                    'string',
                ]
            },
        ]
    },
    Schedule='string',
    Classifiers=[
        'string',
    ],
    TablePrefix='string',
    SchemaChangePolicy={
        'UpdateBehavior': 'LOG'|'UPDATE_IN_DATABASE',
        'DeleteBehavior': 'LOG'|'DELETE_FROM_DATABASE'|'DEPRECATE_IN_DATABASE'
    },
    Configuration='string'
)

Create and then run a crawler that connects to one or more data stores, determines the data structures, and writes tables into the Data Catalog. You can run your crawler on a schedule, as shown following.


response = client.start_crawler(
    Name='string'
)

This example uses Amazon Redshift as the target. Amazon Redshift data types map to AWS Glue data types in the following way after the crawler runs.

Amazon Redshift data type	AWS Glue data type
smallint	smallint
integer	int
bigint	bigint
decimal	decimal(18,0)
decimal(p,s)	decimal(p,s)
real	double
double precision	double
boolean	boolean
char	string
varchar	string
varchar(n)	string
date	date
timestamp	timestamp
timestamptz	timestamp

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

ETL processes

Informatica ETL scripts