Using a data preparation recipe in AWS Glue Studio - AWS Glue

Using a data preparation recipe in AWS Glue Studio

AWS Glue Studio allows you to use a AWS Glue DataBrew recipe in a visual workflow. This allows a customer's AWS Glue DataBrew recipes to be run in a AWS Glue job along with other AWS Glue Studio nodes.

In DataBrew, a recipe is a set of data transformation steps. DataBrew recipes prescribes how to transform data that have already been read and doesn't describe where and how to read data, as well as how and where to write data. This is configured in Source and Target nodes in AWS Glue Studio. For more information on recipes, see Creating and using AWS Glue DataBrew recipes .

The Data Preparation Recipe node is available from the Resource panel. You can connect the Data Preparation Recipe node to another node in the visual workflow, whether it is a Data source node or another transformation node. After choosing a AWS Glue DataBrew recipe and version, the applied steps in the recipe are visible in the node properties tab.

Prerequisites

  • You have a AWS Glue DataBrew recipe created in AWS Glue DataBrew.

  • You have the required IAM permissions as decribed in the section below.

IAM permissions for AWS Glue DataBrew

This topic provides information to help you understand the actions and resources that you an IAM administrator can use in an AWS Identity and Access Management (IAM) policy for the Data Preparation Recipe transform.

For additional information about security in AWS Glue, see Access Management.

The following table lists the permissions that a user needs in order to perform specific operations to use the Data Preparation Recipe transform.

Data Preparaction Recipe transform actions
Action Description
databrew:ListRecipes Grants permission to retrieve AWS Glue DataBrew recipes.
databrew:ListRecipeVersions Grants permission to retrieve AWS Glue DataBrew recipe versions.
databrew:DescribeRecipe Grants permission to retrieve AWS Glue DataBrew recipe description.

The role you’re using for accessing this functionality should have a policy that allows several AWS Glue DataBrew You can achieve this by either using AWSGlueConsoleFullAccess policy that includes necessary actions or add the following inline policy to your role:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "databrew:ListRecipes", "databrew:ListRecipeVersions", "databrew:DescribeRecipe" ], "Resource": [ "*" ] } ] }

To use the Data Preparation Recipe transform, you must add the IAM:PassRole action to the permissions policy.

Additional required permissions
Action Description
iam:PassRole Grants permission for IAM to allow the user to pass the approved roles.

Without these permissions the following error occurs:

"errorCode": "AccessDenied" "errorMessage": "User: arn:aws:sts::account_id:assumed-role/AWSGlueServiceRole is not authorized to perform: iam:PassRole on resource: arn:aws:iam::account_id:role/service-role/AWSGlueServiceRole because no identity-based policy allows the iam:PassRole action"

Limitations

  • Not all AWS Glue DataBrew recipes are supported by AWS Glue. Some recipes will not be able to be run in AWS Glue Studio.

    • Recipes with UNION and JOIN transforms are not supported, however, AWS Glue Studio already has "Join" and "Union" transform nodes which can be used before or after a Data Preparation Recipe node instead.

  • Data Preparation Recipe nodes are supported for jobs starting with AWS Glue version 4.0. This version will be auto-selected after a Data Preparation Recipe node is added to the job.

  • Data Preparation Recipe nodes require Python. This is automatically set when the Data Preparation Recipe node is added to the job.

  • When using Data Preview, you will need to restart your data preview session after adding a Data Preparation Recipe node to your job.

How to use AWS Glue DataBrew recipes in AWS Glue Studio

To use AWS Glue DataBrew recipes in AWS Glue Studio, begin with creating recipes in AWS Glue DataBrew. If you have recipes you want to use, you can skip this step.

To create a AWS Glue DataBrew recipe in AWS Glue DataBrew:
  1. Author a recipe in AWS Glue DataBrew. For more information, see Getting started with AWS Glue DataBrew.

  2. Save your recipe.

  3. Publish your recipe. This will publish your recipe as version 1.0.

To use a Data Preparation Recipe node in AWS Glue Studio:

You can use more than one Data Preparation Recipe node in a visual ETL job. To do this, add a Data Preparation Recipe node by following the steps below and add another Data Preparation Recipe node to the job. For example, a workflow might follow this pattern:

  • Data source 1 > recipe 1 > output 1

  • Data source 2 > recipe 2 > output 2

  • output 1, output 2 > JOIN

  1. Start a AWS Glue job in AWS Glue Studio with a datasource.

  2. Add the Data Preparation Recipe node to your datasource.

  3. Filter for recipe by name by typing in the recipe name in the search field.

  4. Choose the published version. Only published versions are available.

  5. Finish authoring the job by adding other transformations nodes as needed and add Data target node(s) to save the job output.

  6. Make necessary configuration changes in the Job details tab, like naming your job and adjusting allocated capacity as needed, and save the job.

  7. Run the job by choosing Run from the Actions drop-down menu.

To change schema if the data source is Amazon S3 and the data format is CSV:

If all the columns in a CSV file are initially loaded as string data type in AWS Glue Studio, you need to ensure that the column data type is compatible with the rest of the steps in the AWS Glue DataBrew recipe.

AWS Glue DataBrew recipes only prescribes how to transform data that have already been read. It doesn't describe where and how to read data.

  1. Add a Change Schema node before the Multi-step recipe node.

  2. Click the Change Schema node and change the schema to be the same as the column data types in AWS Glue DataBrew by selecting the new data type in the Transform for columns as needed.

    The screenshot shows a Change Schema transform with data type for a column highlighted with a red rectangle.

To change schema if the data source is headerless:

AWS Glue DataBrew recipes only prescribes how to transform data that have already been read. It doesn't describe where and how to read data.

When loading header-less datasets in AWS Glue Studio, the default header names are different than what are loaded in AWS Glue DataBrew.

  1. In the ETL job, add a Change Schema node before the Data Preparation Recipe node.

  2. Choose the Change Schema node and change the column names to the same names used in the AWS Glue DataBrew recipe.