AWS Schema Conversion Tool
User Guide (Version 1.0)

Converting ETL to AWS Glue

In addition to migrating your schema and data, you can also migrate extraction, transformation, and load (ETL) processes. This includes the conversion of ETL-related business logic located either inside the source data warehouses or in external scripts that are executed separately. After migration, the ETL processes run in AWS Glue. You run the ETL to AWS Glue migration as a separate project from converting your DDL and data.


			A chart showing the conversion of databases and ETL.

Currently, only Oracle ETL conversions to AWS Glue are supported.

Prerequisites

Before you begin:

  • Migrate any source databases you intend to migrate to AWS.

  • Migrate the target DW databases to AWS.

  • Collect a list of all the code involved in your ETL process

  • Collect a list of all the necessary connection information for each database

Understanding the AWS Glue Data Catalog

As part of the process of conversion, AWS Glue loads information regarding the source and target databases. It organizes this information into categories, in a structure called a tree. The structure includes the following:

  • Connections – connection parameters for the current Redshift cluster

  • Crawlers – list of crawlers, one crawler per Redshift schema

  • Databases – containers that hold tables

  • Tables – metadata definition that represents the data in the Redshift tables.

  • ETL jobs – business logic that performs the ETL work

  • Triggers – logic that controls when an ETL job runs in AWS Glue (on-demand, schedule, job events)

The AWS Glue Data Catalog contains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. To create your data warehouse, you must catalog this data. The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics of your data. You use the information in the Data Catalog to create and monitor your ETL jobs. Typically, you run a crawler to take inventory of the data in your data stores, but there are other ways to add metadata tables into your Data Catalog.

When you define a table in the AWS Glue Data Catalog, you add it to a database. A database is used to organize tables in AWS Glue.

Limitations for Converting with AWS Glue

The following limitations apply when converting using SCT with AWS Glue:

Resource Default Limit
Number of databases per account 10,000
Number of tables per database 100,000
Number of partitions per table 1,000,000
Number of table versions per table 100,000
Number of tables per account 1,000,000
Number of partitions per account 10,000,000
Number of table versions per account 1,000,000
Number of connections per account 1,000
Number of crawlers per account 25
Number of jobs per account 25
Number of triggers per account 25
Number of concurrent job runs per account 30
Number of concurrent job runs per job 3
Number of jobs per trigger 10
Number of development endpoints per account 5
Maximum DPUs used by a development endpoint at one time 5
Maximum data processing units (DPUs) used by a role at one time 100
Database name length

1…unlimited,

For compatibility with other metadata stores, such as Apache Hive, the name is folded to lowercase characters.

Note

If you plan to access the database from Amazon Athena, then provide a name with only alphanumeric and underscore characters.

Connection name length 1…unlimited
Crawler name length 1…unlimited

Converting using AWS Glue in the SCT UI

The following sections outline the process you can follow to convert ETL using AWS Glue with SCT. For this example, we convert an Oracle databases to Redshift, along with the ETL processes used with the source databases and data warehouses.

Step 1: Create a new project

First, start a new project. To do this, choose File, then choose New Project. You can provide a name and a location to save your new project.


					The new project dialog box

Choose Data warehouse (OLAP). Then choose your source and target database engines.

Make sure that you have enabled Use AWS Glue in the project settings. To view project settings, choose Settings, then choose Project Settings.


					The project settings dialog box, showing the Use AWS
						Glue check box.

To finish preparing to import your ETL, establish connections to your source and target database engines. Choose the menu items that begin with the words Connect to. These labels are cutomized to show the database engines you chose when you created the project. For example, if you chose Oracle and Redshift, the menu displays the options Connect to Oracle and Connect to Amazon Redshift.

After you choose to the option to connect to your source database, enter your connection information. A list of objects displays.

After you choose to the option to connect to your target database, enter your connection information. Next, choose the AWS Glue tab displays. In the Copy from AWS profile list, chose the profile you want to use. The profile should automatically fill in the AWS access key, secret key, and S3 bucket folder. If it doesn't, enter this information yourself. After you choose OK, AWS Glue analyzes the objects and loads metadata into the AWS Glue data catalog.


					The AWS glue connection tab on the target database connection dialog
						box

Note

Depending on your security settings, you might get a warning message that says your account doesn't have sufficient privileges for some of the schemas on the server. If you have access to the schemas you are using, you can safely ignore this message.

AWS Glue creates a database on the source database server and also on the target database server to help with the ETL conversion. The database on the target server contains the AWS Glue data catalog. To locate specific objects, use the search button on the source or target panels.

To see how a specific object converts, locate an item you want to convert, and choose Convert schema from its context menu. The transforms it into a script.


					The context menu of an object in the source database.

When the transformation is complete, the script is stored on the target server in the Scripts folder of the AWS Glue Data Catalog. You can view this on your source database pane on the right side of the screen. To upload the script to S3, choose the script, then choose Save to S3 from its context menu.

Step 2: Create AWS Glue Job

After you save the script to S3, you can select it and choose Configure AWS Glue Job to open the wizard to configure the AWS Glue job. The wizard makes it easier to set this up.

The first tab, Design Data Flow, allows you to choose an execution strategy, and the list of scripts you want to include in this one job. You can choose parameters for each script. You can also rearrange the scripts so they execute in the correct order.

On the second step, or tab, of the wizard, you can name your job, and directly configure settings for AWS Glue. On this screen, you can configure the following settings:

  • IAM role

  • Script file names and file paths

  • Encrypt script using SSE-S3

  • Temporary directory

  • Generated python library path

  • User python library path

  • Dependent jars path

  • Referenced files path

  • Concurrent DPUs per job run

  • Max concurrency

  • Job timeout (minutes)

  • Delay notification threshold (minutes)

  • Number of retries

  • Security configuration

  • Server-side encryption

On the third step, or tab, you choose the configured connection to the target endpoint.

After you finish configuring the job, it displays under the ETL jobs in the AWS Glue data catalog. If you choose the job, the settings display so you can review or edit them. To create a new job in AWS Glue, choose Create AWS Glue Job from the context menu for the job. This applies the schema definition. To refresh the display, choose Refresh from database from the context menu. At this point, you can view your job in the AWS Glue console.

Sign in to the AWS Management Console and open the AWS Glue console at https://console.aws.amazon.com/glue/.

You can test the new job to make sure it is working correctly. Check the data in your source table, then verify the target table is empty. Run the job, and check again. You can view error logs from the AWS Glue console.

Converting using the Python Glue API

The following sections describe a conversion that calls AWS Glue APIs in Python. For more information, see Program AWS Glue ETL Scripts in Python.

Step 1: Create a Database

The first step is to create a new database in a data catalog by using the AWS SDK API. When you define a table in the AWS Glue data catalog, you add it to a database. A database is used to organize the tables in AWS Glue.

The following example demonstrates the create_database method of the Python Glue API:

response = client.create_database( DatabaseInput={ 'Name': 'database_name’, 'Description': 'description', 'LocationUri': 'string', 'Parameters': { 'parameter-name': 'parameter value' } } )

The database name is formed as follows:

{redshift_cluster_name}_{redshift_database_name}_{ redshift_shema_name}

For example:

rsdbb03_dev_ora_glue

The full name of Redshift cluster, for example, is formed as follows:

aabbcc03.apq1mpqso.us-west-2.redshift.amazonaws.com

Step 2: Create a Connection

Create a new connection in a data catalog by using the AWS SDK API.

The following example demonstrates using the create_connection method of the Python Glue API:

response = client.create_connection( ConnectionInput={ 'Name': 'Redshift_abcde03.aabbcc112233.us-west-2.redshift.amazonaws.com_dev', 'Description': 'Created from SCT', 'ConnectionType': 'JDBC', 'ConnectionProperties': { 'JDBC_CONNECTION_URL': 'jdbc:redshift://aabbcc03.aabbcc112233.us-west-2.redshift.amazonaws.com:5439/dev', 'USERNAME': 'user_name', 'PASSWORD': 'password' }, 'PhysicalConnectionRequirements': { 'AvailabilityZone': 'us-west-2c', 'SubnetId': 'subnet-a1b23c45', 'SecurityGroupIdList': [ 'sg-000a2b3c', 'sg-1a230b4c', 'sg-aba12c3d', 'sg-1abb2345' ] } } )

The parameters used in create_connection are as follows:

  • Name (UTF-8 string) [required] – the connection name is formed as follows: Redshift_{Endpoint-name}_{redshift-database-name}. For example: Redshift_abcde03_dev

  • Description (UTF-8 string) – your description of the connection.

  • ConnectionType (UTF-8 string) [required] – the type of connection. Curently, only JDBC is supported; SFTP is not supported.

  • ConnectionProperties (dict) [required] – a list of key-value pairs used as parameters for this connection.

  • PhysicalConnectionRequirements (dict) –

  • SubnetId (UTF-8 string) – the subnet ID used by the connection.

  • SecurityGroupIdList (list) – the security group ID list used by the connection.

  • AvailabilityZone (UTF-8 string) [required] – the availability zone that contains the endpoint. Deprecated.

Step 3: Creating a AWS Glue Crawler

The first step is to create a new database in a data catalog by using the AWS SDK API. Before creating a crawler, you must first delete any previous version of it by using delete_crawler.

The following example demonstrates the create_crawler method of the Python Glue API:

The following example deletes an existing crawler and then creates a new one, by using the Python Glue API:

response = client.delete_crawler( Name='crawler_name' ) response = client.create_crawler( Name='crawler_name', Role= ‘IAM_role’, DatabaseName='database_name’, Description='string', Targets={ 'S3Targets': [ { 'Path': 'string', 'Exclusions': [ 'string', ] }, ], 'JdbcTargets': [ { 'ConnectionName': ‘ConnectionName’, 'Path': ‘Include_path’, 'Exclusions': [ 'string', ] }, ] }, Schedule='string', Classifiers=[ 'string', ], TablePrefix='string', SchemaChangePolicy={ 'UpdateBehavior': 'LOG'|'UPDATE_IN_DATABASE', 'DeleteBehavior': 'LOG'|'DELETE_FROM_DATABASE'|'DEPRECATE_IN_DATABASE' }, Configuration='string' )

Note

The crawler name is formed as follows: {redshift_node_name}_{redshift_database_name}_{ redshift_shema_name}, for example: abcde03_dev_ora_glue.

The IAM role must already exist.

Use the database name you created in the previous steps.

ConnectionName is required.

The path is the path to the JDBC target. For example: dev/ora_glue/%.

After creating crawler, run a crawler that connects to one or more data stores, determines the data structures, and writes tables into the Data Catalog. You can run your crawler on a schedule.

response = client.start_crawler( Name='string' )

After the crawler executes, Redshift data types map to Glue data type in the following way:

Redshift data type maps to Glue data type
smalint smallint
integer int
bigint bigint
decimal decimal(18,0)
decimal(p,s) decimal(p,s)
real double
double precision double
boolean boolean
char string
varchar string
varchar(n) string
date date
timestamp timestamp
timestamptz timestamp