Building a secure data pipeline
The Well-Architected Security pillar focuses on protecting data and information. It describes how to take advantage of cloud technologies to protect data and information in a way that can improve your security posture. Following are some of the best practices to consider to improve the security of data and information with your data pipeline in AWS Glue.
Data encryption
Encryption of data at rest
AWS Glue supports encryption for authoring jobs in AWS Glue, and
developing scripts using development endpoints. Encryption
configurations can be provided by attaching a security
configuration. Security configurations contain Amazon S3-managed
server-side encryption keys (SSE-S3) or customer managed keys
(CMKs) stored in
AWS Key Management Services
You can encrypt the metadata stored in the AWS Glue Data Catalog using AWS KMS. Additionally, you can use AWS KMS keys to encrypt job bookmarks, and the logs generated by AWS Glue crawlers and ETL jobs.
In AWS Glue, you control encryption settings in the following places:
-
The settings of your Data Catalog.
-
The security configurations that you create.
-
The server-side encryption setting (SSE-S3 or SSE-KMS) that is passed as a parameter to your AWS Glue ETL job.
Encrypting your Data Catalog
You can turn on encryption of your AWS AWS Glue Data Catalog objects. You can turn on or turn off encryption settings for the entire Data Catalog. In the process, you specify an AWS KMS key that is automatically used when objects, such as tables, are written to the Data Catalog. The encrypted objects include the following:
-
Databases
-
Tables
-
Partitions
-
Table versions
-
Connections
-
User-defined functions
AWS AWS Glue Data Catalog also allows you to encrypt connection specific passwords using KMS keys.
Encryption of data in transit
AWS provides Transport Layer Security (TLS) encryption for data in motion between AWS Glue and S3.
An AWS Glue connection allows you to configure TLS certificates for database configurations as well.
Network security
AWS AWS Glue Data Catalog and AWS Glue ETL are serverless services,
and can be accessed outside of VPCs by default using AWS Glue
APIs. AWS Glue provides three
AWS Identity and Access Managementglue:VpcIds
, glue:SubnetIds
, and
glue:SecurityGroupIds
.
You can use the condition keys in IAM policies when granting
permissions to create and update jobs. You can use this setting to
ensure that jobs are not created (or updated to) to run outside of
a desired VPC environment.
To support certain use cases, AWS Glue ETL jobs may need to connect to services running inside a VPC - for example, a database running inside a private subnet. To support these use cases, AWS Glue provides an AWS Glue network connection feature. You can configure an AWS Glue network connection based on a VPC, VPC subnet, and security groups and attach it to an AWS Glue job. When the AWS Glue job runs, it creates an Elastic Network Interface (ENI) using the configuration defined in the network connection, and uses that ENI to access resources running within the VPC, making the connection secure and private without routing traffic through public networks.
Applications and services running inside a VPC without NAT Gateway or Internet Gateway cannot connect to AWS Glue out of the box, because there is no network path to AWS Glue API or AWS Glue services. To allow applications running in such environments to access AWS Glue, you can create and attach AWS Glue Virtual Private Endpoints (VPCE) to the VPC subnets and route AWS Glue-specific connections through VPCE.
Managing Identity and Access in AWS Glue
Authentication
AWS provides two primary permission mechanisms to access AWS Glue: using an IAM user, or an IAM role.
-
IAM user — A user is an identity within your AWS account that has specific custom permissions; for example, permissions to create a table in AWS Glue, or run an AWS Glue ETL job.
-
IAM role — An IAM role is an IAM identity that you can create in your account that has specific permissions. An IAM role is similar to a user, however, instead of being uniquely associated with one person, a role is intended to be assumable by anyone who needs it. Also, a role does not have standard long-term credentials such as a password or access keys associated with it. Instead, when you assume a role, it provides you with temporary security credentials for your role session.
IAM roles with temporary credentials are useful while using:
-
Federated user access from AWS Directory Service
, your enterprise user directory, or a web identity provider. -
AWS service access – A service role is an IAM role that a service assumes to perform actions on your behalf. An IAM administrator can create, modify, and delete a service role from within IAM. Example: When the AWS Glue service runs an AWS Glue ETL job on your behalf.
-
To provide access, add permissions to your users, groups, or roles:
-
Users and groups in AWS IAM Identity Center:
Create a permission set. Follow the instructions in Create a permission set in the AWS IAM Identity Center User Guide.
-
Users managed in IAM through an identity provider:
Create a role for identity federation. Follow the instructions in Creating a role for a third-party identity provider (federation) in the IAM User Guide.
-
IAM users:
-
Create a role that your user can assume. Follow the instructions in Creating a role for an IAM user in the IAM User Guide.
-
(Not recommended) Attach a policy directly to a user or add a user to a user group. Follow the instructions in Adding permissions to a user (console) in the IAM User Guide.
-
Permission policy example:
The following policy grants all permissions on an AWS Glue table named
books
in database db1
. This includes read and write permissions
on the table itself, on archived versions of it, and on all its partitions. This policy can
then be attached to a role to grant them these permissions.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "FullAccessOnTable", "Effect": "Allow", "Action": [ "glue:CreateTable", "glue:GetTable", "glue:GetTables", "glue:UpdateTable", "glue:DeleteTable", "glue:BatchDeleteTable", "glue:GetTableVersion", "glue:GetTableVersions", "glue:DeleteTableVersion", "glue:BatchDeleteTableVersion", "glue:CreatePartition", "glue:BatchCreatePartition", "glue:GetPartition", "glue:GetPartitions", "glue:BatchGetPartition", "glue:UpdatePartition", "glue:DeletePartition", "glue:BatchDeletePartition" ], "Resource": [ "arn:aws:glue:us-west-2:123456789012:catalog", "arn:aws:glue:us-west-2:123456789012:database/db1", "arn:aws:glue:us-west-2:123456789012:table/db1/books" ] } ] }
AWS Lake Formation
AWS Lake Formation
In the above permission example, access to the database db1 and the books are granted through an IAM policy. This permission model works well if you have small number of databases and tables to share among a handful of users. However, many of our customers manage hundreds of databases and thousands of tables using the AWS Glue Data Catalog. The data for those tables span many S3 bucket locations which need to be shared securely among multiple teams and users. In these scenarios, IAM policy-based access control becomes complex to manage.
AWS Lake Formation was built to simplify the permission management for data lakes that manage a large data catalog.
When you enable Lake Formation for a Region, all permission
management for the AWS Glue Data Catalog is automatically governed by
Lake Formation going forward. Any legacy IAM based permissions for
AWS Glue Data Catalog may still work if you have enabled backward
compatibility through IAMAllowedPrincipals
. You can start managing
the Lake Formation permission through the AWS Management Console
or though Lake Formation API.
Lake Formation access management involves following constructs:
-
Principal – A Principal is any one of an IAM role, user, Security Assertion Markup Language (SAML) group, or SAML user.
-
Resources – A resource is any of the following elements: data location, AWS Glue Data Catalog, databases, tables, columns, cells, and LF tags.
Lake Formation permissions can be managed through two access control models:
Resource Based Access Control (RBAC) — In the RBAC model, you can grant or revoke permissions to resources such as database, table, and column for a principal such as an IAM role or user. Based on the resource type, the available permissions may vary. For a detailed definition of resources and permissions, refer to Lake Formation Permissions Reference.
Tag Based Access Control (TBAC) — In the TBAC model, you can create LF-tags which are key-value pairs (Example: classification=confidential, pii=true) and attach them to Resources and Principals. You can then assign and revoke permissions on resources using these LF-tags. Lake Formation allows operations on those resources when the principal's tag matches the resource tag. This model allows you to decouple permissions from resource creation which helps govern large number of databases, tables, and columns by removing the need to update permissions every time a new resource is added to the data lake. For detailed information about TBAC, refer to Overview of Lake Formation Tag-Based Access Control.