AWS Schema Conversion Tool
User Guide (Version 1.0)

The AWS Documentation website is getting a new look!
Try it now and let us know what you think. Switch to the new look >>

You can return to the original look by selecting English in the language selector above.

Converting ETL Processes to AWS Glue

In addition to migrating your schema and data using AWS SCT, you can also migrate extraction, transformation, and load (ETL) processes. This type of migration includes the conversion of ETL-related business logic located either inside the source data warehouses or in external scripts that are run separately. After migration, the ETL processes run in AWS Glue. You run the ETL to AWS Glue migration as a separate project from converting your data definition language (DDL) statements and data.


			A diagram showing the conversion of databases and ETL.

Currently, only Oracle ETL and Teradata BTEQ conversions to AWS Glue are supported.

Prerequisites

Before you begin, do the following:

  • Migrate any source databases that you intend to migrate to AWS.

  • Migrate the target data warehouses to AWS.

  • Collect a list of all the code involved in your ETL process.

  • Collect a list of all the necessary connection information for each database.

Understanding the AWS Glue Data Catalog

As part of the process of conversion, AWS Glue loads information regarding the source and target databases. It organizes this information into categories, in a structure called a tree. The structure includes the following:

  • Connections – connection parameters

  • Crawlers – a list of crawlers, one crawler for each schema

  • Databases – containers that hold tables

  • Tables – metadata definitions that represent the data in the tables

  • ETL jobs – business logic that performs the ETL work

  • Triggers – logic that controls when an ETL job runs in AWS Glue (whether on-demand, by schedule, or triggered by job events)

The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics of your data. When you work with AWS Glue and AWS SCT, the AWS Glue Data Catalog contains references to data that is used as sources and targets of your ETL jobs in AWS Glue. To create your data warehouse, you must catalog this data.

You use the information in the Data Catalog to create and monitor your ETL jobs. Typically, you run a crawler to take inventory of the data in your data stores, but there are other ways to add metadata tables into your Data Catalog.

When you define a table in your Data Catalog, you add it to a database. A database is used to organize tables in AWS Glue.

Limitations for Converting with AWS Glue

The following limitations apply when converting using AWS SCT with AWS Glue.

Resource Default Limit
Number of databases for each account 10,000
Number of tables for each database 100,000
Number of partitions for each table 1,000,000
Number of table versions for each table 100,000
Number of tables for each account 1,000,000
Number of partitions for each account 10,000,000
Number of table versions for each account 1,000,000
Number of connections for each account 1,000
Number of crawlers for each account 25
Number of jobs for each account 25
Number of triggers for each account 25
Number of concurrent job runs for each account 30
Number of concurrent job runs for each job 3
Number of jobs for each trigger 10
Number of development endpoints for each account 5
Maximum DPUs used by a development endpoint at one time 5
Maximum data processing units (DPUs) used by a role at one time 100
Database name length

Unlimited

For compatibility with other metadata stores, such as Apache Hive, the name is changed to use lowercase characters.

Note

If you plan to access the database from Athena, provide a name with only alphanumeric and underscore characters.

Connection name length Unlimited
Crawler name length Unlimited

Converting Using AWS Glue in the AWS SCT UI

In the following sections, you can find an outline of the process to follow to convert ETL using AWS Glue with AWS SCT. For this example, we convert an Oracle database to Amazon Redshift, along with the ETL processes used with the source databases and data warehouses.

Step 1: Create a New Project

First, start a new project. To do this, start AWS SCT and choose File, New Project. Provide a name and a location to save your new project.


					The new project dialog box

Choose Data warehouse (OLAP), and then choose your source and target database engines.

Make sure that you have enabled Use AWS Glue in the project settings. To view project settings, choose Settings, Project Settings.


					The project settings dialog box, showing the Use AWS Glue
						check box.

To finish preparing to import your ETL, establish connections to your source and target database engines. Choose the menu items that begin with the words Connect to. These labels are customized to show the database engines that you chose when you created the project. For example, if you chose Oracle and Amazon Redshift, the menu displays the options Connect to Oracle and Connect to Amazon Redshift.

After you choose the option to connect to your source database, enter your connection information. You know that the connection is complete when SCT displays a list of your database objects.

After you choose the option to connect to your target database, enter your connection information. Next, choose the AWS Glue tab that appears. For Copy from AWS profile, choose the profile that you want to use. The profile should automatically fill in the AWS access key, secret key, and Amazon S3 bucket folder. If it doesn't, enter this information yourself. After you choose OK, AWS Glue analyzes the objects and loads metadata into the AWS Glue Data Catalog. The following screenshot shows an example of what this screen looks like. For the example, we're using Amazon Redshift as our target database.


					The AWS Glue connection tab on the target database connection dialog
						box

Note

Depending on your security settings, you might get a warning message that says your account doesn't have sufficient privileges for some of the schemas on the server. If you have access to the schemas that you're using, you can safely ignore this message.

AWS Glue creates a database on the source database server and also on the target database server to help with the ETL conversion. The database on the target server contains the AWS Glue Data Catalog. To locate specific objects, use search on the source or target panels.

To see how a specific object converts, locate an item you want to convert, and choose Convert schema from its context (right-click) menu. AWS SCT transforms it into a script.


					The context menu of an object in the source database.

When the transformation is complete, the script is stored on the target server in the Scripts folder of the AWS Glue Data Catalog. You can view this on your source database pane on the right side of the screen. To upload the script to S3, choose the script, then choose Save to S3 from its context (right-click) menu.

Step 2: Create an AWS Glue Job

After you save the script to S3, you can choose it and then choose Configure AWS Glue Job to open the wizard to configure the AWS Glue job. The wizard makes it easier to set this up.

On the first tab, Design Data Flow, you can choose an execution strategy and the list of scripts you want to include in this one job. You can choose parameters for each script. You can also rearrange the scripts so that they run in the correct order.

On the second step, or tab, of the wizard, you can name your job, and directly configure settings for AWS Glue. On this screen, you can configure the following settings:

  • IAM role

  • Script file names and file paths

  • Encrypt the script using server-side encryption with Amazon S3–managed keys (SSE-S3)

  • Temporary directory

  • Generated Python library path

  • User Python library path

  • Dependent .jars path (path for the dependent .jar files)

  • Referenced files path

  • Concurrent DPUs for each job run

  • Maximum concurrency

  • Job timeout (in minutes)

  • Delay notification threshold (in minutes)

  • Number of retries

  • Security configuration

  • Server-side encryption

On the third step, or tab, you choose the configured connection to the target endpoint.

After you finish configuring the job, it displays under the ETL jobs in the AWS Glue Data Catalog. If you choose the job, the settings display so you can review or edit them. To create a new job in AWS Glue, choose Create AWS Glue Job from the context (right-click) menu for the job. Doing this applies the schema definition. To refresh the display, choose Refresh from database from the context (right-click) menu.

At this point, you can view your job in the AWS Glue console. To do so, sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/.

You can test the new job to make sure that it's working correctly. To do so, first check the data in your source table, then verify that the target table is empty. Run the job, and check again. You can view error logs from the AWS Glue console.

Converting Using the Python API for AWS Glue

In the following sections, you can find a description of a conversion that calls AWS Glue API operations in Python. For more information, see Program AWS Glue ETL Scripts in Python in the AWS Glue Developer Guide.

Step 1: Create a Database

The first step is to create a new database in an AWS Glue Data Catalog by using the AWS SDK API. When you define a table in the Data Catalog, you add it to a database. A database is used to organize the tables in AWS Glue.

The following example demonstrates the create_database method of the Python API for AWS Glue.

response = client.create_database( DatabaseInput={ 'Name': 'database_name’, 'Description': 'description', 'LocationUri': 'string', 'Parameters': { 'parameter-name': 'parameter value' } } )

If you are using Amazon Redshift, the database name is formed as follows.

{redshift_cluster_name}_{redshift_database_name}_{redshift_schema_name}

The full name of Amazon Redshift cluster for this example is as follows:

rsdbb03.apq1mpqso.us-west-2.redshift.amazonaws.com

The following shows an example of a well-formed database name. In this case rsdbb03 is the name, which is the first part of the full name of the cluster endpoint. The database is named dev and the schema is ora_glue.

rsdbb03_dev_ora_glue

Step 2: Create a Connection

Create a new connection in a Data Catalog by using the AWS SDK API.

The following example demonstrates using the create_connection method of the Python API for AWS Glue.

response = client.create_connection( ConnectionInput={ 'Name': 'Redshift_abcde03.aabbcc112233.us-west-2.redshift.amazonaws.com_dev', 'Description': 'Created from SCT', 'ConnectionType': 'JDBC', 'ConnectionProperties': { 'JDBC_CONNECTION_URL': 'jdbc:redshift://aabbcc03.aabbcc112233.us-west-2.redshift.amazonaws.com:5439/dev', 'USERNAME': 'user_name', 'PASSWORD': 'password' }, 'PhysicalConnectionRequirements': { 'AvailabilityZone': 'us-west-2c', 'SubnetId': 'subnet-a1b23c45', 'SecurityGroupIdList': [ 'sg-000a2b3c', 'sg-1a230b4c', 'sg-aba12c3d', 'sg-1abb2345' ] } } )

The parameters used in create_connection are as follows:

  • Name (UTF-8 string) – required. For Amazon Redshift, the connection name is formed as follows: Redshift_{Endpoint-name}_{redshift-database-name}, for example: Redshift_abcde03_dev

  • Description (UTF-8 string) – your description of the connection.

  • ConnectionType (UTF-8 string) – required; the type of connection. Currently, only JDBC is supported; SFTP is not supported.

  • ConnectionProperties (dict) – required; a list of key-value pairs used as parameters for this connection, including the JDBC connection URL, the user name, and the password.

  • PhysicalConnectionRequirements (dict) – physical connection requirements, which include the following:

    • SubnetId (UTF-8 string) – the ID of the subnet used by the connection.

    • SecurityGroupIdList (list) – the security group ID list used by the connection.

    • AvailabilityZone (UTF-8 string) – required; the Availability Zone that contains the endpoint. This parameter is deprecated.

Step 3: Creating an AWS Glue Crawler

Next, you create an AWS Glue crawler to populate the AWS Glue catalog. For more information, see Cataloging Tables with a Crawler in the AWS Glue Developer Guide. The first step in adding a crawler is to create a new database in a Data Catalog by using the AWS SDK API. Before you begin, you must first delete any previous version of it by using the delete_crawler operation.

When you create your crawler, a few considerations apply:

  • For the crawler name, use the format <redshift_node_name>_<redshift_database_name>_<redshift_shema_name>, for example: abcde03_dev_ora_glue

  • Use an IAM role that already exists. For more information on creating IAM roles, see Creating IAM Roles in the IAM User Guide.

  • Use the name of the database that you created in the previous steps.

  • Use the ConnectionName parameter, which is required.

  • For the path parameter, use the path to the JDBC target, for example: dev/ora_glue/%

The following example deletes an existing crawler and then creates a new one by using the Python API for AWS Glue.

response = client.delete_crawler( Name='crawler_name' ) response = client.create_crawler( Name='crawler_name', Role= ‘IAM_role’, DatabaseName='database_name’, Description='string', Targets={ 'S3Targets': [ { 'Path': 'string', 'Exclusions': [ 'string', ] }, ], 'JdbcTargets': [ { 'ConnectionName': ‘ConnectionName’, 'Path': ‘Include_path’, 'Exclusions': [ 'string', ] }, ] }, Schedule='string', Classifiers=[ 'string', ], TablePrefix='string', SchemaChangePolicy={ 'UpdateBehavior': 'LOG'|'UPDATE_IN_DATABASE', 'DeleteBehavior': 'LOG'|'DELETE_FROM_DATABASE'|'DEPRECATE_IN_DATABASE' }, Configuration='string' )

Create and then run a crawler that connects to one or more data stores, determines the data structures, and writes tables into the Data Catalog. You can run your crawler on a schedule, as shown following.

response = client.start_crawler( Name='string' )

Because we're using Amazon Redshift as our target for this example, Amazon Redshift data types map to Glue data type in the following way after the crawler runs.

Amazon Redshift Data Type AWS Glue Data Type
smalint smallint
integer int
bigint bigint
decimal decimal(18,0)
decimal(p,s) decimal(p,s)
real double
double precision double
boolean boolean
char string
varchar string
varchar(n) string
date date
timestamp timestamp
timestamptz timestamp