EMRFS S3 plugin - Amazon EMR

EMRFS S3 plugin

To make it easier to provide access controls against objects in S3 on a multi-tenant cluster, the EMRFS S3 plugin provides access controls to the data within S3 when accessing it through EMRFS. You can allow access to S3 resources at a user and group level.

To achieve this, when your application attempts to access data within S3, EMRFS sends a request for credentials to the Secret Agent process, where the request is authenticated and authorized against an Apache Ranger plugin. If the request is authorized, then the Secret Agent assumes the IAM role for Apache Ranger Engines with a restricted policy to generate credentials that only have access to the Ranger policy that allowed the access. The credentials are then passed back to EMRFS to access S3.

Supported features

EMRFS S3 plugin provides storage level authorization. Policies can be created to provide access to users and groups to S3 buckets and prefixes. Authorization is done only against EMRFS.

Installation of service configuration

To install the EMRFS service definition, you must set up the Ranger Admin server. To set up the server, see Set up Ranger Admin server.

Follow these steps to install the EMRFS service definition.

Step 1: SSH into the Apache Ranger Admin server.

For example:

ssh ec2-user@ip-xxx-xxx-xxx-xxx.ec2.internal

Step 2: Download the EMRFS service definition.

In a temporary directory, download the Amazon EMR service definition. This service definition is supported by Ranger 2.x versions.

wget https://s3.amazonaws.com/elasticmapreduce/ranger/service-definitions/version-2.0/ranger-servicedef-amazon-emr-emrfs.json

Step 3: Register EMRFS S3 service definition.

curl -u *<admin users login>*:*_<_**_password_ **_for_** _ranger admin user_**_>_* -X POST -d @ranger-servicedef-amazon-emr-emrfs.json \ -H "Accept: application/json" \ -H "Content-Type: application/json" \ -k 'https://*<RANGER SERVER ADDRESS>*:6182/service/public/v2/api/servicedef'

If this command runs successfully, you see a new service in the Ranger Admin UI called "AMAZON-EMR-S3", as shown in the following image (Ranger version 2.0 is shown).

Ranger Admin create EMRFS S3 service.

Step 4: Create an instance of the AMAZON-EMR-EMRFS application.

Create an instance of the service definition.

  • Click on the + next to AMAZON-EMR-EMRFS.

Fill in the following fields:

Service Name (If displayed): The suggested value is amazonemrspark. Note this service name as it will be needed when creating an EMR security configuration.

Display Name: The name displayed for this service. The suggested value is amazonemrspark.

Common Name For Certificate: The CN field within the certificate used to connect to the admin server from a client plugin. This value must match the CN field in the TLS certificate that was created for the plugin.

Ranger Admin edit EMRFS S3 service.
Note

The TLS certificate for this plugin should have been registered in the trust store on the Ranger Admin server. See TLS certificates for Apache Ranger integration with Amazon EMR for more details.

When the service is created, the Service Manager includes "AMAZON-EMR-EMRFS", as shown in the following image.

Ranger Admin showing new EMRFS S3 service.

Creating EMRFS S3 policies

To create a new policy in the Create policy page of the Service Manager, fill in the following fields.

Policy Name: The name of this policy.

Policy Label: A label that you can put on this policy.

S3 Resource: A resource starting with the bucket and optional prefix. See EMRFS S3 policies usage notes for information on best practices. Resources in Ranger Admin server should not contain s3://, s3a:// or s3n://.

Ranger Admin showing create policy for EMRFS S3 service.

You can specify users and groups to grant permissions. You can also specify exclusions for allow conditions and deny conditions.

Ranger Admin showing user/group permissions for EMRFS S3 policy.
Note

A maximum of three resources are allowed for each policy. Adding more than three resources may result in an error when this policy is used on an EMR cluster. Adding more than three policies displays a reminder about the policy limit.

EMRFS S3 policies usage notes

When creating S3 policies within Apache Ranger, there are some usage considerations to be aware of.

Permissions to multiple S3 objects

You can use recursive policies and wildcard expressions to give permissions to multiple S3 objects with common prefixes. Recursive policies give permissions to all objects with a common prefix. Wildcard expressions select multiple prefixes. Together, they give permissions to all objects with multiple common prefixes as shown in the following examples.

Example Using a recursive policy

Suppose you want permissions to list all the parquet files in an S3 bucket organized as follows.

s3://sales-reports/americas/ +- year=2000 | +- data-q1.parquet | +- data-q2.parquet +- year=2019 | +- data-q1.json | +- data-q2.json | +- data-q3.json | +- data-q4.json | +- year=2020 | +- data-q1.parquet | +- data-q2.parquet | +- data-q3.parquet | +- data-q4.parquet | +- annual-summary.parquet +- year=2021

First, consider the parquet files with the prefix s3://sales-reports/americas/year=2000. You can grant GetObject permissions to all of them in two ways:

Using non-recursive policies: One option is to use two separate non-recursive policies, one for the directory and the other for the files.

The first policy grants permission to the prefix s3://sales-reports/americas/year=2020 (there is no trailing /).

- S3 resource = "sales-reports/americas/year=2000" - permission = "GetObject" - user = "analyst"

The second policy uses wildcard expression to grant permissions all the files with prefix sales-reports/americas/year=2020/ (note the trailing /).

- S3 resource = "sales-reports/americas/year=2020/*" - permission = "GetObject" - user = "analyst"

Using a recursive policy: A more convenient alternative is to use a single recursive policy and grant recursive permission to the prefix.

- S3 resource = "sales-reports/americas/year=2020" - permission = "GetObject" - user = "analyst" - is recursive = "True"

So far, only the parquet files with the prefix s3://sales-reports/americas/year=2000 have been included. You can now also include the parquet files with a different prefix, s3://sales-reports/americas/year=2020, into the same recursive policy by introducing a wildcard expression as follows.

- S3 resource = "sales-reports/americas/year=20?0" - permission = "GetObject" - user = "analyst" - is recursive = "True"

Policies for PutObject and DeleteObject permissions

Writing policies for PutObject and DeleteObject permissions to files on EMRFS need special care because, unlike GetObject permissions, they need additional recursive permissions granted to the prefix.

Example Policies for PutObject and DeleteObject permissions

For example, deleting the file annual-summary.parquet requires not only a DeleteObject permission to the actual file.

- S3 resource = "sales-reports/americas/year=2020/annual-summary.parquet" - permission = "DeleteObject" - user = "analyst"

It also requires a policy granting recursive GetObject and PutObject permissions to its prefix.

Similarly, modifying the file annual-summary.parquet, requires not only a PutObject permission to the actual file.

- S3 resource = "sales-reports/americas/year=2020/annual-summary.parquet" - permission = "PutObject" - user = "analyst"

It also requires a policy granting recursive GetObject permission to its prefix.

- S3 resource = "sales-reports/americas/year=2020" - permission = "GetObject" - user = "analyst" - is recursive = "True"

Wildcards in policies

There are two areas in which wildcards can be specified. When specifying an S3 resource, the "*" and "?" can be used. The "*" provides matching against an S3 path and matches everything after the prefix. For example, the following policy.

S3 resource = "sales-reports/americas/*"

This matches the following S3 paths.

sales-reports/americas/year=2020/ sales-reports/americas/year=2019/ sales-reports/americas/year=2019/month=12/day=1/afile.parquet sales-reports/americas/year=2018/month=6/day=1/afile.parquet sales-reports/americas/year=2017/afile.parquet

The "?" wildcard matches only a single character. For example, for the policy.

S3 resource = "sales-reports/americas/year=201?/"

This matches the following S3 paths.

sales-reports/americas/year=2019/ sales-reports/americas/year=2018/ sales-reports/americas/year=2017/

Wildcards in users

There are two built-in wildcards when assigning users to provide access to users. The first is the "{USER}" wildcard that provides access to all users. The second wildcard is "{OWNER}", which provides access to the owner of a particular object or directly. However, the "{USER}" wildcard is currently not supported.

Limitations

The following are current limitations of the EMRFS S3 plugin:

  • Apache Ranger policies can have at most three policies.

  • Access to S3 must be done through EMRFS and can be used with Hadoop-related applications. The following is not supported:

    - Boto3 libraries

    - AWS SDK and AWK CLI

    - S3A open source connector

  • Apache Ranger deny policies are not supported.

  • Operations on S3 with keys having CSE-KMS encryption are currently not supported.

  • Cross-Region support is not supported.

  • Apache Ranger’s Security Zone feature is not supported. Access control restrictions defined using the Security Zone feature are not applied on your Amazon EMR clusters.

  • The Hadoop user does not generate any audit events as Hadoop always accesses the EC2 Instance Profile.

  • It's recommended that you disable Amazon EMR Consistency View. S3 is strongly consistent, so it's no longer needed. See Amazon S3 strong consistency for more information.

  • The EMRFS S3 plugin makes numerous STS calls. It's advised that you do load testing on a development account and monitor STS call volume. It is also recommended that you make an STS request to raise AssumeRole service limits.

  • The Ranger Admin server doesn't support auto-complete.