Using EMR Serverless with AWS Lake Formation for fine-grained access control (Preview)
Note
Amazon EMR Serverless with AWS Lake Formation is in preview release and is subject to change. The
feature is provided as a Preview service as defined in
the AWS Service
Terms
Overview
With Amazon EMR 6.15.0, you can leverage AWS Lake Formation to apply fine-grained access controls on Data Catalog tables that are backed by S3. This allows you to configure table, row, column, and cell level access controls for read queries within your Amazon EMR Serverless Spark jobs.
Using Amazon EMR Serverless with AWS Lake Formation incurs additional charges. For more
information, see Amazon EMR
pricing
How EMR Serverless works with AWS Lake Formation
Using EMR Serverless with Lake Formation lets you enforce a layer of permissions on each Spark
job to apply Lake Formation permissions control when EMR Serverless executes jobs.
EMR Serverless uses Spark resource profiles
When you use pre-initialized capacity with Lake Formation, we recommend that you have a minimum of two Spark drivers. Each Lake Formation-enabled job utilizes two Spark drivers, one for the user profile and one for the system profile. For the best performance, you should use double the number of drivers for Lake Formation-enabled jobs compared to not using Lake Formation
The following is a high-level overview of how EMR Serverless gets access to data protected by Lake Formation security policies.
![How Amazon EMR accesses data protected by Lake Formation security policies](/images/emr/latest/EMR-Serverless-UserGuide/images/lf-emr-s-architecture.png)
-
A user submits Spark job to an AWS Lake Formationenabled EMR Serverless application.
-
EMR Serverless sends the job to a user driver and running the user profile. The user driver runs a lean version of Spark that has no ability to launch tasks, request executors, access S3 or the Glue Catalog. It builds a job plan.
-
EMR Serverless sets up a second driver called the system driver and runs it in the system space (with a privileged identity). EMR Serverless sets up an encrypted TLS channel between the two drivers for communication. The user driver uses the channel to send the job plans to the system driver. The system driver does not run user-submitted code. It runs full Spark and communicates with S3, and the Data Catalog for data access. It request executors and compiles the Job Plan into a sequence of execution stages.
-
EMR Serverless then runs then stages on executors with the user profile or system profile. User code in any stage is run exclusively on user profile executors.
-
Stages that read data from Data Catalog tables protected by AWS Lake Formation or those that apply security filters are delegated to system executors.
Enabling Lake Formation in Amazon EMR
To enable Lake Formation, you must set spark.emr-serverless.lakeformation.enabled
to true
under spark-defaults
classification for the
runtime-configuration parameter when creating an EMR Serverless application.
aws emr-serverless create-application \ --release-label emr-6.15.0 \ --runtime-configuration '{ "classification": "spark-defaults", "properties": { "spark.emr-serverless.lakeformation.enabled": "true" } }' \ --type "SPARK"
You can also enable Lake Formation when you create a new application in EMR Studio. Choose Use Lake Formation for fine-grained access control, available under Additional configurations.
Inter-worker encryption is enabled by default when you use Lake Formation with EMR Serverless, so you don't have to explicitly enable inter-worker encryption again.
Job runtime role IAM permissions
Lake Formation permissions control access to AWS Glue Data Catalog resources, Amazon S3 locations, and the
underlying data at those locations. IAM permissions control access to the Lake Formation and
AWS Glue APIs and resources. Although you might have the Lake Formation permission to access a table
in the Data Catalog (SELECT), your operation fails if you don’t have the IAM permission on
the glue:Get*
API operation.
The following is an example policy of how to provide IAM permissions to access a script in S3, uploading logs to S3, AWS Glue API permissions, and permission to access Lake Formation.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "ScriptAccess", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::*.DOC-EXAMPLE-BUCKET/scripts", "arn:aws:s3:::*.DOC-EXAMPLE-BUCKET/*" ] }, { "Sid": "LoggingAccess", "Effect": "Allow", "Action": [ "s3:PutObject" ], "Resource": [ "arn:aws:s3:::DOC-EXAMPLE-BUCKET/logs/*" ] }, { "Sid": "GlueCatalogAccess", "Effect": "Allow", "Action": [ "glue:Get*", "glue:Create*", "glue:Update*" ], "Resource": ["*"] }, { "Sid": "LakeFormationAccess", "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess" ], "Resource": ["*"] } ] }
Setting up Lake Formation permissions for job runtime role
First, register the location of your Hive table with Lake Formation. Then create permissions for your job runtime role on your desired table. For more details about Lake Formation, see What is AWS Lake Formation? in the AWS Lake Formation Developer Guide.
After you set up the Lake Formation permissions, you can submit Spark jobs on Amazon EMR Serverless. For more information about Spark jobs, see Spark examples.
Submitting a job run
After you finish setting up the Lake Formation grants, you can
submit Spark jobs on EMR Serverless. To run Iceberg jobs, you must
provide the following spark-submit
properties.
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog --conf spark.sql.catalog.spark_catalog.warehouse=<
S3_DATA_LOCATION
> --conf spark.sql.catalog.spark_catalog.glue.account-id=<AWS_ACCOUNT_ID
> --conf spark.sql.catalog.spark_catalog.client.region=<AWS_REGION
> --conf spark.sql.catalog.spark_catalog.glue.endpoint=https://glue.<AWS_REGION
>.amazonaws.com
Open-table format support
Amazon EMR release 6.15.0 includes support for fine-grained access control based on Lake Formation. EMR Serverless supports Hive formats. The following table describes all of the supported operations.
Operations | Hive | Iceberg |
---|---|---|
DDL commands | With IAM role permissions only | Selectively (DDL requiring Spark extensions not supported) and with IAM role permissions only |
Incremental queries | Not applicable | Not supported |
Time travel queries | Not applicable to this table format | Fully supported |
Metadata tables | Not applicable to this table format | Not supported |
DML INSERT |
With IAM permissions only | With IAM permissions only |
DML UPDATE | Not applicable to this table format | Not supported |
DML DELETE |
Not applicable to this table format | Not supported |
Read operations | Fully supported | Fully supported |