Setting up for AWS Glue Studio - AWS Glue Studio

Setting up for AWS Glue Studio

Complete the tasks in this section when you're using AWS Glue Studio for the first time:

Sign up for AWS

If you do not have an AWS account, complete the following steps to create one.

To sign up for an AWS account

  1. Open https://portal.aws.amazon.com/billing/signup.

  2. Follow the online instructions.

    Part of the sign-up procedure involves receiving a phone call and entering a verification code on the phone keypad.

Create an IAM administrator user

If your account already includes an IAM user with full AWS administrative permissions, you can skip this section.

To create an administrator user for yourself and add the user to an administrators group (console)

  1. Sign in to the IAM console as the account owner by choosing Root user and entering your AWS account email address. On the next page, enter your password.

    Note

    We strongly recommend that you adhere to the best practice of using the Administrator IAM user that follows and securely lock away the root user credentials. Sign in as the root user only to perform a few account and service management tasks.

  2. In the navigation pane, choose Users and then choose Add user.

  3. For User name, enter Administrator.

  4. Select the check box next to AWS Management Console access. Then select Custom password, and then enter your new password in the text box.

  5. (Optional) By default, AWS requires the new user to create a new password when first signing in. You can clear the check box next to User must create a new password at next sign-in to allow the new user to reset their password after they sign in.

  6. Choose Next: Permissions.

  7. Under Set permissions, choose Add user to group.

  8. Choose Create group.

  9. In the Create group dialog box, for Group name enter Administrators.

  10. Choose Filter policies, and then select AWS managed - job function to filter the table contents.

  11. In the policy list, select the check box for AdministratorAccess. Then choose Create group.

    Note

    You must activate IAM user and role access to Billing before you can use the AdministratorAccess permissions to access the AWS Billing and Cost Management console. To do this, follow the instructions in step 1 of the tutorial about delegating access to the billing console.

  12. Back in the list of groups, select the check box for your new group. Choose Refresh if necessary to see the group in the list.

  13. Choose Next: Tags.

  14. (Optional) Add metadata to the user by attaching tags as key-value pairs. For more information about using tags in IAM, see Tagging IAM entities in the IAM User Guide.

  15. Choose Next: Review to see the list of group memberships to be added to the new user. When you are ready to proceed, choose Create user.

You can use this same process to create more groups and users and to give your users access to your AWS account resources. To learn about using policies that restrict user permissions to specific AWS resources, see Access management and Example policies.

Signing in as an IAM user

Sign in to the IAM console by choosing IAM user and entering your AWS account ID or account alias. On the next page, enter your IAM user name and your password.

Note

For your convenience, the AWS sign-in page uses a browser cookie to remember your IAM user name and account information. If you previously signed in as a different user, choose the sign-in link beneath the button to return to the main sign-in page. From there, you can enter your AWS account ID or account alias to be redirected to the IAM user sign-in page for your account.

IAM permissions needed for the AWS Glue Studio user

To use AWS Glue Studio, the user must have access to various AWS resources. The user must be able to view and select Amazon S3 buckets, IAM policies and roles, and AWS Glue Data Catalog objects.

AWS Glue service permissions

AWS Glue Studio uses the actions and resources of the AWS Glue service. Your user needs permissions on these actions and resources to effectively use AWS Glue Studio. You can grant the AWS Glue Studio user the AWSGlueConsoleFullAccess managed policy, or create a custom policy with a smaller set of permissions.

Important

Per security best practices, it is recommended to restrict access by tightening policies to further restrict access to Amazon S3 bucket and Amazon CloudWatch log groups. For an example Amazon S3 policy, see Writing IAM Policies: How to Grant Access to an Amazon S3 Bucket.

Amazon CloudWatch permissions

You can monitor your AWS Glue Studio jobs using Amazon CloudWatch, which collects and processes raw data from AWS Glue into readable, near-real-time metrics. By default, AWS Glue metrics data is sent to CloudWatch automatically. For more information, see What Is Amazon CloudWatch? in the Amazon CloudWatch User Guide, and AWS Glue Metrics in the AWS Glue Developer Guide.

To access CloudWatch dashboards, the user accessing AWS Glue Studio needs one of the following:

  • The AdministratorAccess policy

  • The CloudWatchFullAccess policy

  • A custom policy that includes one or more of these specific permissions:

    • cloudwatch:GetDashboard and cloudwatch:ListDashboards to view dashboards

    • cloudwatch:PutDashboard to create or modify dashboards

    • cloudwatch:DeleteDashboards to delete dashboards

For more information for changing permissions for an IAM user using policies, see Changing Permissions for an IAM User in the IAM User Guide.

Job-related permissions

When you create a job using AWS Glue Studio, the job assumes the permissions of the IAM role that you specify when you create it. This IAM role must have permission to extract data from your data source, write data to your target, and access AWS Glue resources.

The name of the role that you create for the job must start with the string AWSGlueServiceRole for it to be used correctly by AWS Glue Studio. For example, you might name your role AWSGlueServiceRole-FlightDataJob.

Data source and data target permissions

An AWS Glue Studio job must have access to Amazon S3 for any sources, targets, scripts, and temporary directories that you use in your job. You can create a policy to provide fine-grained access to specific Amazon S3 resources.

  • Data sources require s3:ListBucket and s3:GetObject permissions.

  • Data targets require s3:ListBucket, s3:PutObject, and s3:DeleteObject permissions.

If you choose Amazon Redshift as your data source, you can provide a role for cluster permissions. Jobs that run against a Amazon Redshift cluster issue commands that access Amazon S3 for temporary storage using temporary credentials. If your job runs for more than an hour, these credentials will expire causing the job to fail. To avoid this problem, you can assign a role to the Amazon Redshift cluster itself that grants the necessary permissions to jobs using temporary credentials. For more information, see Moving Data to and from Amazon Redshift in the AWS Glue Developer Guide.

If the job uses data sources or targets other than Amazon S3, then you must attach the necessary permissions to the IAM role used by the job to access these data sources and targets. For more information, see Setting Up Your Environment to Access Data Stores in the AWS Glue Developer Guide.

If you're using connectors and connections for your data store, you need additional permissions, as described in Additional permissions when using connectors.

Permissions required for deleting jobs

In AWS Glue Studio you can select multiple jobs in the console to delete. To perform this action, you must have the glue:BatchDeleteJob permission. This is different from the AWS Glue console, which requires the glue:DeleteJob permission for deleting jobs.

AWS Key Management Service permissions

If you plan to access Amazon S3 sources and targets that use server-side encryption with AWS Key Management Service (AWS KMS), then attach a policy to the AWS Glue Studio role used by the job that enables the job to decrypt the data. The job role needs the kms:ReEncrypt, kms:GenerateDataKey, and kms:DescribeKey permissions. Additionally, the job role needs the kms:Decrypt permission to upload or download an Amazon S3 object that is encrypted with an AWS KMS customer master key (CMK).

There are additional charges for using AWS KMS CMKs. For more information, see AWS Key Management Service Concepts - Customer Master Keys (CMKs) and AWS Key Management Service Pricing in the AWS Key Management Service Developer Guide.

Additional permissions when using connectors

If you're using an AWS Glue Custom Connector and connection to access a data store, the role used to run the AWS Glue ETL job needs additional permissions attached:

  • The AWS managed policy AmazonEC2ContainerRegistryReadOnly for accessing connectors purchased from AWS Marketplace.

  • The glue:GetJob and glue:GetJobs permissions.

  • AWS Secrets Manager permissions for accessing secrets that are used with connections. Refer to IAM policy examples for secrets in AWS Secrets Manager for example IAM policies.

If your AWS Glue ETL job runs within a VPC running Amazon VPC, then the VPC must be configured as described in Configuring a VPC for your ETL job.

Set up IAM permissions for AWS Glue Studio

You can create the roles and assign policies to users and job roles by using the AWS administrator user.

To create an IAM policy and role for use with AWS Glue Studio

  1. Create an IAM policy for the AWS Glue service.

    You can use the AWSGlueConsoleFullAccess AWS managed policy.

    To create your own policy, follow the steps documented in Create an IAM Policy for the AWS Glue Service in the AWS Glue Developer Guide.

  2. Create an IAM role for AWS Glue and attach the IAM policy to this role.

    Follow the steps documented in Create an IAM Role for AWS Glue in the AWS Glue Developer Guide.

  3. Create a user for AWS Glue or AWS Glue Studio.

    You can either use the administrator user for configuring AWS Glue resources, or you can create a separate user for accessing AWS Glue Studio.

    To create additional users for AWS Glue and AWS Glue Studio, follow the steps in Creating Your First IAM Delegated User and Group in the IAM User Guide.

Configuring a VPC for your ETL job

You can use Amazon Virtual Private Cloud (Amazon VPC) to define a virtual network in your own logically isolated area within the AWS Cloud, known as a virtual private cloud (VPC). You can launch your AWS resources, such as instances, into your VPC. Your VPC closely resembles a traditional network that you might operate in your own data center, with the benefits of using the scalable infrastructure of AWS. You can configure your VPC; you can select its IP address range, create subnets, and configure route tables, network gateways, and security settings. You can connect instances in your VPC to the internet. You can connect your VPC to your own corporate data center, making the AWS Cloud an extension of your data center. To protect the resources in each subnet, you can use multiple layers of security, including security groups and network access control lists. For more information, see the Amazon VPC User Guide.

You can configure your AWS Glue ETL jobs to run within a VPC when using connectors. You must configure your VPC for the following, as needed:

  • Public network access for data stores not in AWS. All data stores that are accessed by the job must be available from the VPC subnet.

  • If your job needs to access both VPC resources and the public internet, the VPC needs to have a network address translation (NAT) gateway inside the VPC.

    For more information, see Setting Up Your Environment to Access Data Stores in the AWS Glue Developer Guide.

Populate the AWS Glue Data Catalog

AWS Glue Studio uses datasets that are defined in the AWS Glue Data Catalog. These datasets are used as sources and targets for ETL workflows in AWS Glue Studio. If you choose the Data Catalog for your data source or target, then the Data Catalog tables related to your data source or data target must exist prior to creating a job.

When reading from or writing to a data source, your ETL job needs to know the schema of the data. The ETL job can get this information from a table in the AWS Glue Data Catalog. You can use a crawler, the AWS Glue console, AWS CLI, or an AWS CloudFormation template file to add databases and tables to the Data Catalog. For more information about populating the Data Catalog, see Data Catalog in the AWS Glue Developer Guide.

When using connectors, you can use the schema builder to enter the schema information when you configure the data source node of your ETL job in AWS Glue Studio. For more information, see Authoring jobs with custom connectors.

If you choose an Amazon S3 location as your data source, AWS Glue Studio can automatically infer the schema of the data it reads from the files at the specified location. For more information, see Using files in Amazon S3 for the data source.

If you choose a streaming data source, AWS Glue Studio can automatically infer the schema of the data it reads from the data stream. For more information, see Using a streaming data source.