Building a secure data pipeline - AWS Glue Best Practices: Building a Secure and Reliable Data Pipeline

Building a secure data pipeline

The Well-Architected Security pillar focuses on protecting data and information. It describes how to take advantage of cloud technologies to protect data and information in a way that can improve your security posture. Following are some of the best practices to consider to improve the security of data and information with your data pipeline in AWS Glue.

Data encryption

Encryption of data at rest

AWS Glue supports encryption for authoring jobs in AWS Glue, and developing scripts using development endpoints. Encryption configurations can be provided by attaching a security configuration. Security configurations contain Amazon S3-managed server-side encryption keys (SSE-S3) or customer managed keys (CMKs) stored in AWS Key Management Services (AWS KMS) (SSE-KMS). It is also worth noting that AWS Glue, as of writing this document, supports only symmetric CMKs.

You can encrypt the metadata stored in the AWS Glue Data Catalog using AWS KMS. Additionally, you can use AWS KMS keys to encrypt job bookmarks, and the logs generated by AWS Glue crawlers and ETL jobs.

In AWS Glue, you control encryption settings in the following places:

  • The settings of your Data Catalog.

  • The security configurations that you create.

  • The server-side encryption setting (SSE-S3 or SSE-KMS) that is passed as a parameter to your AWS Glue ETL job.

Encrypting your Data Catalog

You can turn on encryption of your AWS AWS Glue Data Catalog objects. You can turn on or turn off encryption settings for the entire Data Catalog. In the process, you specify an AWS KMS key that is automatically used when objects, such as tables, are written to the Data Catalog. The encrypted objects include the following:

  • Databases

  • Tables

  • Partitions

  • Table versions

  • Connections

  • User-defined functions

AWS AWS Glue Data Catalog also allows you to encrypt connection specific passwords using KMS keys.

Encryption of data in transit

AWS provides Transport Layer Security (TLS) encryption for data in motion between AWS Glue and S3.

An AWS Glue connection allows you to configure TLS certificates for database configurations as well.

Network security

AWS AWS Glue Data Catalog and AWS Glue ETL are serverless services, and can be accessed outside of VPCs by default using AWS Glue APIs. AWS Glue provides three AWS Identity and Access Management (AWS IAM) condition keys glue:VpcIdsglue:SubnetIds, and glue:SecurityGroupIds. You can use the condition keys in IAM policies when granting permissions to create and update jobs. You can use this setting to ensure that jobs are not created (or updated to) to run outside of a desired VPC environment. 

To support certain use cases, AWS Glue ETL jobs may need to connect to services running inside a VPC - for example, a database running inside a private subnet. To support these use cases, AWS Glue provides an AWS Glue network connection feature. You can configure an AWS Glue network connection based on a VPC, VPC subnet, and security groups and attach it to an AWS Glue job. When the AWS Glue job runs, it creates an Elastic Network Interface (ENI) using the configuration defined in the network connection, and uses that ENI to access resources running within the VPC, making the connection secure and private without routing traffic through public networks.

Applications and services running inside a VPC without NAT Gateway or Internet Gateway cannot connect to AWS Glue out of the box, because there is no network path to AWS Glue API or AWS Glue services. To allow applications running in such environments to access AWS Glue, you can create and attach AWS Glue Virtual Private Endpoints (VPCE) to the VPC subnets and route AWS Glue-specific connections through VPCE.

Managing Identity and Access in AWS Glue

Authentication

AWS provides two primary permission mechanisms to access AWS Glue: using an IAM user, or an IAM role.

  • IAM user — A user is an identity within your AWS account that has specific custom permissions; for example, permissions to create a table in AWS Glue, or run an AWS Glue ETL job.

  • IAM role — An IAM role is an IAM identity that you can create in your account that has specific permissions. An IAM role is similar to a user, however, instead of being uniquely associated with one person, a role is intended to be assumable by anyone who needs it. Also, a role does not have standard long-term credentials such as a password or access keys associated with it. Instead, when you assume a role, it provides you with temporary security credentials for your role session.

    IAM roles with temporary credentials are useful while using:

    • Federated user access from AWS Directory Service, your enterprise user directory, or a web identity provider.

    • AWS service access – A service role is an IAM role that a service assumes to perform actions on your behalf. An IAM administrator can create, modify, and delete a service role from within IAM. Example: When the AWS Glue service runs an AWS Glue ETL job on your behalf.

To provide access, add permissions to your users, groups, or roles:

Permission policy example:

The following policy grants all permissions on an AWS Glue table named books in database db1. This includes read and write permissions on the table itself, on archived versions of it, and on all its partitions. This policy can then be attached to a role to grant them these permissions.

{ "Version": "2012-10-17", "Statement": [ { "Sid": "FullAccessOnTable", "Effect": "Allow", "Action": [ "glue:CreateTable", "glue:GetTable", "glue:GetTables", "glue:UpdateTable", "glue:DeleteTable", "glue:BatchDeleteTable", "glue:GetTableVersion", "glue:GetTableVersions", "glue:DeleteTableVersion", "glue:BatchDeleteTableVersion", "glue:CreatePartition", "glue:BatchCreatePartition", "glue:GetPartition", "glue:GetPartitions", "glue:BatchGetPartition", "glue:UpdatePartition", "glue:DeletePartition", "glue:BatchDeletePartition" ], "Resource": [ "arn:aws:glue:us-west-2:123456789012:catalog", "arn:aws:glue:us-west-2:123456789012:database/db1", "arn:aws:glue:us-west-2:123456789012:table/db1/books" ] } ] }

AWS Lake Formation

AWS Lake Formation makes it easy to set up a secure data lake by providing easy grant/revoke access to the AWS Glue Data Catalog and the underlying S3 locations without the need of managing granular permissions in IAM.

In the above permission example, access to the database db1 and the books are granted through an IAM policy. This permission model works well if you have small number of databases and tables to share among a handful of users. However, many of our customers manage hundreds of databases and thousands of tables using the AWS Glue Data Catalog. The data for those tables span many S3 bucket locations which need to be shared securely among multiple teams and users. In these scenarios, IAM policy-based access control becomes complex to manage.

AWS Lake Formation was built to simplify the permission management for data lakes that manage a large data catalog.

When you enable Lake Formation for a Region, all permission management for the AWS Glue Data Catalog is automatically governed by Lake Formation going forward. Any legacy IAM based permissions for AWS Glue Data Catalog may still work if you have enabled backward compatibility through IAMAllowedPrincipals. You can start managing the Lake Formation permission through the AWS Management Console or though Lake Formation API.

Lake Formation access management involves following constructs:

  • Principal – A Principal is any one of an IAM role, user, Security Assertion Markup Language (SAML) group, or SAML user.

  • Resources – A resource is any of the following elements: data location, AWS Glue Data Catalog, databases, tables, columns, cells, and LF tags.

Lake Formation permissions can be managed through two access control models:

Resource Based Access Control (RBAC) — In the RBAC model, you can grant or revoke permissions to resources such as database, table, and column for a principal such as an IAM role or user. Based on the resource type, the available permissions may vary. For a detailed definition of resources and permissions, refer to Lake Formation Permissions Reference.

Tag Based Access Control (TBAC) — In the TBAC model, you can create LF-tags which are key-value pairs (Example: classification=confidential, pii=true) and attach them to Resources and Principals. You can then assign and revoke permissions on resources using these LF-tags. Lake Formation allows operations on those resources when the principal's tag matches the resource tag. This model allows you to decouple permissions from resource creation which helps govern large number of databases, tables, and columns by removing the need to update permissions every time a new resource is added to the data lake. For detailed information about TBAC, refer to Overview of Lake Formation Tag-Based Access Control.