Tutorial: Getting started with AWS Glue Studio - AWS Glue Studio

Tutorial: Getting started with AWS Glue Studio

You can use AWS Glue Studio to create jobs that extract structured or semi-structured data from a data source, perform a transformation of that data, and save the result set in a data target.

In this tutorial, you will create a job in AWS Glue Studio using Amazon S3 as the Source and Target. By completing these steps, you will learn how visual jobs are created and how to edit nodes, the component building blocks in the visual job editor.

You will learn how to:

  • Configure the data source node to a data source. In this tutorial, you will set the data source to Amazon S3.

  • Apply and edit a transform node. In this tutorial, you will apply the ApplyMapping transform to the job.

  • Configure the data target node. In this tutorial, you will set the data target to Amazon S3.

  • View and edit the job script.

  • Run the job and view run details for the job.

Prerequisites

This tutorial has the following prerequisites:

  • You have an AWS account.

  • You have access to AWS Glue Studio.

  • Your account has all the necessary permissions for creating and running a job for an Amazon S3 data source and data target. For more information, see Setting up for AWS Glue Studio.

Launch the AWS CloudFormation stack

The AWS CloudFormation stack has all the resources you need to complete this tutorial.

  1. Launch the following AWS CloudFormation stack to create resources for this tutorial by clicking on the button, then follow the steps to complete the process.

  2. Name the AWS CloudFormation stack CreateJob-Tutorial.

  3. Then, select the I acknowledge that AWS CloudFormation might create IAM resources with custom names option.

  4. Choose Create stack.

Launching this stack creates AWS resources. The following resources shown in the AWS CloudFormation output are the ones you need in the next steps:

  • Key – Description

  • AWS Glue StudioRole – IAM role to run AWS Gluejobs

  • AWS Glue StudioAmazon S3Bucket – Name of the Amazon S3 bucket to store blog-related files

  • AWS Glue StudioTicketsYYZDB – AWS Glue Data Catalog database

  • AWS Glue StudioTableTickets – Data Catalog table to use as a source

  • AWS Glue StudioTableTrials – Data Catalog table to use as a source

  • AWS Glue StudioParkingTicketCount – Data Catalog table to use as the destination

Step 1: Start the job creation process

In this task, you choose to start the job creation by using a template.

To create a job, starting with a template

  1. Sign in to the AWS Management Console and open the AWS Glue Studio console at https://console.aws.amazon.com/gluestudio/.

  2. On the AWS Glue Studio landing page, choose View jobs under the heading Create and manage jobs.

    
            The screenshot shows the AWS Glue Studio landing page with the 
              Create and manage jobs section highlighted.
  3. On the Jobs page, under the heading Create job, the following options will be selected by default:

    • Visual with a source and target

    • For the Source: Amazon Simple Storage Service

    • For the Target: Amazon Simple Storage Service

  4. Choose the Create button to start the job creation process.

The job editing page opens with a simple three-node job diagram displayed.


        The screenshot shows the job editing page with various components.
  • A - the Visual job editor canvas. This is where you can add nodes to create a job.

  • B - a Visual job is represented by nodes on the canvas. When a node is selected, it will be highlighted by a blue line.

  • C - the node panel contains several tabs: Node properties, Output schema and Data preview. When the node is selected, the node panel is displayed and a new tab unique to the node is displayed for additional configuration. For more information, see Job editor features.

  • D - the job editor tab ribbon. By default, Visual is selected. You can also choose: Script, Job details, Runs, and Schedules. Runs and Schedules are available after the job has been run. For more information, see Editing ETL jobs in AWS Glue Studio.

  • E - the node toolbar provides actions to add Source, Transform and Target nodes, undo and redo actions, remove nodes and zoom in/out across the job editing canvas. For more information, see Editing ETL jobs in AWS Glue Studio.

  • F - by default, the job is named 'Untitled job'. Click the text box to change the job name to a unique name.

  • G - the job editor action menus allow you to save, run, and delete the job. The Actions drop-down menu also provides additional options when running the job.

Step 2: Edit the data source node in the job diagram

Choose the Data source - S3 bucket node in the job diagram to edit the data source properties.

To edit the data source node

  1. By default, the Data source properties - Amazon S3 tab is displayed.

    
              The screenshot shows the Data source properties - Amazon S3 tab and fields.
  2. By default, the Data Catalog table option for the Amazon S3 source type is already selected. This is because the source type is determined by the 'Node type' in the Node properties tab. By default, Amazon S3 is 'Node type'.

  3. For Database, choose the yyz-tickets database from the list of available databases in your AWS Glue Data Catalog. This database was already created for you when you launched the AWS CloudFormation stack earlier in this tutorial.

  4. For Table, click the drop-down menu and then choose the tickets table from your AWS Glue Data Catalog. This table was already created for you when you launched the AWS CloudFormation stack earlier in this tutorial.

    After you have provided the required information for the data source node, a green check mark appears on the node in the job diagram.

  5. (Optional) Choose the Output schema tab in the node details panel to view the data schema.

  6. (Optional) On the Node properties tab in the node details pane, for Name, enter a name that is unique for this job.

    
            The screenshot shows the Node properties tab.

    The value you enter is used as the label for the data source node in the job diagram. If you use unique names for the nodes in your job, then it's easier to identify each node in the job diagram, and also to select parent nodes.

    You can also set the node type. Changing the node type will change the fields in the data source properties tab.

Step 3: Edit the transform node of the job

The transform node is where you specify how you want to modify the data from its original format. An ApplyMapping transform enables you to rename data property keys, change the data types, and drop columns from the dataset.

When you edit the Transform - ApplyMapping node, the original schema for your data is shown in the Source key column in the node details panel. This is the data property key name (column name) that is obtained from the source data and stored in the table in the AWS Glue Data Catalog.

The Target key column shows the key name that will appear in the data target. You can use this field to change the data property key name in the output. The Data type column shows the data type of the key and allows you to change it to different data type for the target. The Drop column contains a check box. This box allows you to choose a field to drop it from the target schema.

To edit the transform node

  1. Choose the Transform - ApplyMapping node in the job diagram to edit the data transformation properties.

  2. In the node details panel, on the Node properties tab, review the information.

    Change the name of the node to Ticket_Mapping.

  3. Choose the Transform tab in the node details panel.

    
            The screenshot shows the Apply mapping transform tab and fields.
  4. Choose to drop the keys by selecting the check box in the Drop column for each key:

    • location1

    • location2

    • location3

    • location4

    • province

  5. For the source key officer, change the Target key value to officer_name.

    Change the data type for the ticket_number and set_fine_amount keys to float. When changing the data type, you must verify that the data type is supported by your target.

  6. (Optional) Choose the Output schema tab in the node details panel to view the modified schema.

Notice that the Transform - Apply Mapping node in the job diagram now has a green check mark, indicating that the node has been edited and has all the required information.

Step 4: Edit the data target node of the job

A data target node determines where the transformed output is sent. The location can be an Amazon S3 bucket, a Data Catalog table, or a connector and connection. If you choose a Data Catalog table, the data is written to the location associated with that table. For example, if you use a crawler to create a table in the Data Catalog for a JDBC target, the data is written to that JDBC table.

To edit the data target node

  1. Choose the Data target - S3 bucket node in the job diagram to edit the data target properties.

  2. In the node details panel on the right, choose the Node properties tab. For Name, enter a unique name for the node.

  3. Choose the Data target properties - S3 tab.

    
            The screenshot shows the Data target properties - Amazon S3 tab and available fields.
  4. For each field, make the following selections.

    For more information about the available options, see Overview of data target options.

    • Format: Parquet

    • Compression Type: GZIP

    • S3 Target Location: Choose the Browse S3 button to see the Amazon S3 buckets that you have access to. Choose an Amazon S3 bucket as the target destination.

    • Data Catalog update options: Do not update the Data Catalog

Step 5: Specify the job details and save the job

Before you can save and run your extract, transform, and load (ETL) job, you must first enter additional information about the job itself.

To specify the job details and save the job

  1. Choose the Job details tab.

  2. Enter a name for the job. Provide a UTF-8 string with a maximum length of 255 characters.

    (Optional) Enter a description of the job. Descriptions can be up to 2048 characters long.

  3. For the IAM role, choose AWS Glue StudioRole from the list of available roles.

    Note

    The AWS Identity and Access Management (IAM) role is used to authorize access to resources that are used to run the job. You can only choose roles that already exist in your account. The role you choose must have permission to access your Amazon S3 sources, targets, temporary directory, scripts, and any libraries used by the job, as well as access to AWS Glue service resources.

    For the steps to create a role, see Create an IAM Role for AWS Glue in the AWS Glue Developer Guide.

    You might have to add access to the target Amazon S3 bucket to this role.

    If you have many roles to choose from, you can start entering part of the role name in the IAM role search field, and the roles with the matching text string will be displayed. For example, you can enter 'tutorial' in the search field to find all roles with tutorial (case-insensitive) in the name.

  4. For the remaining fields, use the default values.

  5. Choose Save in the top-right corner of the page.

    You should see a notification at the top of the page that the job was successfully saved.

    
            The screenshot shows a successful confirmation message when the Save button
            is clicked.
Note

If you don't see a notification that your job was successfully saved, then there is most likely information missing that prevents the job from being saved.

  • Review the job in the visual editor, and choose any node that doesn't have a green check mark.

  • If any of the tabs above the visual editor pane have a callout, choose that tab and look for any fields that are highlighted in red.

Step 6: Run the job

Now that the job has been saved, you can run the job.

  1. Choose the Run button at the top of the page. You should then see a notification that the job was successfully started. You can also choose the Runs tab and choose Run jobs.

    
            The screenshot shows a successful confirmation message when the Run button is clicked.
  2. To view the job run details, click the link in the notification for Run Details, or choose the Runs tab to view the run status of the job.

  3. To view the job run details in the Runs tab, view the job run detail card for the recent job run. For more information about job run information, see View information for recent job runs.

Congratulations on completing this tutorial! You have learned how to create a visual job, edit nodes, inspect the job script, save and run a job, and view run details.

Next steps

After you start the job run, you might want to try some of the following tasks: