Menu
AWS Key Management Service
Developer Guide

How Amazon EMR Uses AWS KMS

When you use an Amazon EMR cluster, you can configure the cluster to encrypt data at rest, which means the cluster encrypts data before saving it to a persistent storage location. You can encrypt data at rest on the EMR File System (EMRFS), on the storage volumes of cluster nodes, or both. To encrypt data at rest, you can use a customer master key (CMK) in AWS KMS. The following topics explain how an Amazon EMR cluster uses a CMK to encrypt data at rest.

Amazon EMR clusters also encrypt data in transit, which means the cluster encrypts data before sending it through the network. You cannot use a CMK to encrypt data in transit. For more information, see In-Transit Data Encryption in the Amazon EMR Release Guide.

For more information about all the encryption options available in Amazon EMR, see Understanding Encryption Options with Amazon EMR in the Amazon EMR Release Guide.

Encrypting Data on the EMR File System (EMRFS)

Amazon EMR clusters use two distributed files systems:

  • The Hadoop Distributed File System (HDFS). HDFS encryption does not use a CMK in AWS KMS.

     

  • The EMR File System (EMRFS). EMRFS is an implementation of HDFS that allows Amazon EMR clusters to store data in Amazon Simple Storage Service (Amazon S3). EMRFS supports four encryption options, two of which use a CMK in AWS KMS. For more information about all four of the EMRFS encryption options, see At-Rest Encryption for Amazon S3 with EMRFS in the Amazon EMR Release Guide.

The two EMRFS encryption options that use a CMK use the following encryption features offered by Amazon S3:

When you configure an Amazon EMR cluster to encrypt data on EMRFS with SSE-KMS or CSE-KMS, you choose the CMK in AWS KMS that you want Amazon S3 or the Amazon EMR cluster to use. With SSE-KMS, you can choose the AWS-managed CMK for Amazon S3 with the alias aws/s3, or a custom CMK that you create. With CSE-KMS, you must choose a custom CMK that you create. When you choose a custom CMK, you must ensure that your Amazon EMR cluster has permission to use the CMK. For more information, see Add the EMR Instance Role to an AWS KMS CMK in the Amazon EMR Release Guide.

For both SSE-KMS and CSE-KMS, the CMK you choose is the master key in an envelope encryption workflow. This means the data is encrypted with a unique data encryption key (or data key), and this data key is encrypted under the CMK in AWS KMS. The encrypted data and an encrypted copy of its data key are stored together as a single encrypted object in an S3 bucket. For more information about how this works, see the following topics.

Process for Encrypting Data on EMRFS with SSE-KMS

When you configure an Amazon EMR cluster to use SSE-KMS, the encryption process works like this:

  1. The cluster sends data to Amazon S3 for storage in an S3 bucket.

  2. Amazon S3 sends a GenerateDataKey request to AWS KMS, specifying the key ID of the CMK that you chose when you configured the cluster to use SSE-KMS. The request includes encryption context; for more information, see Encryption Context.

  3. AWS KMS generates a unique data encryption key (data key) and then sends two copies of this data key to Amazon S3. One copy is unencrypted (plaintext), and the other copy is encrypted under the CMK.

  4. Amazon S3 uses the plaintext data key to encrypt the data that it received in step 1, and then removes the plaintext data key from memory as soon as possible after use.

  5. Amazon S3 stores the encrypted data and the encrypted copy of the data key together as a single encrypted object in an S3 bucket.

The decryption process works like this:

  1. The cluster requests an encrypted data object from an S3 bucket.

  2. Amazon S3 extracts the encrypted data key from the S3 object, and then sends the encrypted data key to AWS KMS with a Decrypt request. The request includes encryption context; for more information, see Encryption Context.

  3. AWS KMS decrypts the encrypted data key using the same CMK that was used to encrypt it, and then sends the decrypted (plaintext) data key to Amazon S3.

  4. Amazon S3 uses the plaintext data key to decrypt the encrypted data, and then removes the plaintext data key from memory as soon as possible after use.

  5. Amazon S3 sends the decrypted data to the cluster.

Process for Encrypting Data on EMRFS with CSE-KMS

When you configure an Amazon EMR cluster to use CSE-KMS, the encryption process works like this:

  1. When it's ready to store data in Amazon S3, the cluster sends a GenerateDataKey request to AWS KMS, specifying the key ID of the CMK that you chose when you configured the cluster to use CSE-KMS. The request includes encryption context; for more information, see Encryption Context.

  2. AWS KMS generates a unique data encryption key (data key) and then sends two copies of this data key to the cluster. One copy is unencrypted (plaintext), and the other copy is encrypted under the CMK.

  3. The cluster uses the plaintext data key to encrypt the data, and then removes the plaintext data key from memory as soon as possible after use.

  4. The cluster combines the encrypted data and the encrypted copy of the data key together into a single encrypted object.

  5. The cluster sends the encrypted object to Amazon S3 for storage.

The decryption process works like this:

  1. The cluster requests the encrypted data object from an S3 bucket.

  2. Amazon S3 sends the encrypted object to the cluster.

  3. The cluster extracts the encrypted data key from the encrypted object, and then sends the encrypted data key to AWS KMS with a Decrypt request. The request includes encryption context; for more information, see Encryption Context.

  4. AWS KMS decrypts the encrypted data key using the same CMK that was used to encrypt it, and then sends the decrypted (plaintext) data key to the cluster.

  5. The cluster uses the plaintext data key to decrypt the encrypted data, and then removes the plaintext data key from memory as soon as possible after use.

Encrypting Data on the Storage Volumes of Cluster Nodes

An Amazon EMR cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a cluster node or node. Each node can have two types of storage volumes: instance store volumes, and Amazon Elastic Block Store (Amazon EBS) volumes. You can configure the cluster to use Linux Unified Key Setup (LUKS) to encrypt both types of storage volumes on the nodes (but not the boot volume of each node). This is called local disk encryption.

When you enable local disk encryption for a cluster, you can choose to encrypt the LUKS master key with a CMK in AWS KMS. You must choose a custom CMK that you create; you cannot use an AWS-managed CMK. When you choose a custom CMK, you must ensure that your Amazon EMR cluster has permission to use the CMK. For more information, see Add the EMR Instance Role to an AWS KMS CMK in the Amazon EMR Release Guide.

When you enable local disk encryption using a CMK, the encryption process works like this:

  1. When each cluster node launches, it sends a GenerateDataKey request to AWS KMS, specifying the key ID of the CMK that you chose when you enabled local disk encryption for the cluster.

  2. AWS KMS generates a unique data encryption key (data key) and then sends two copies of this data key to the node. One copy is unencrypted (plaintext), and the other copy is encrypted under the CMK.

  3. The node uses a base64-encoded version of the plaintext data key as the password that protects the LUKS master key. The node saves the encrypted copy of the data key on its boot volume.

  4. If the node reboots, the rebooted node sends the encrypted data key to AWS KMS with a Decrypt request.

  5. AWS KMS decrypts the encrypted data key using the same CMK that was used to encrypt it, and then sends the decrypted (plaintext) data key to the node.

  6. The node uses the base64-encoded version of the plaintext data key as the password to unlock the LUKS master key.

Encryption Context

Each AWS service that is integrated with AWS KMS can specify encryption context when it uses AWS KMS to generate data keys or to encrypt or decrypt data. Encryption context is additional authenticated information that AWS KMS uses to check for data integrity. When a service specifies encryption context for an encryption operation, it must specify the same encryption context for the corresponding decryption operation or decryption will fail. Encryption context is also written to AWS CloudTrail log files, which can help you understand why a given CMK was used. For more information about encryption context, see Encryption Context.

The following section explain the encryption context that is used in each Amazon EMR encryption scenario that uses a CMK.

Encryption Context for EMRFS Encryption with SSE-KMS

With SSE-KMS, the Amazon EMR cluster sends data to Amazon S3, and then Amazon S3 uses a CMK to encrypt the data before saving it to an S3 bucket. In this case, Amazon S3 uses the Amazon Resource Name (ARN) of the S3 object as encryption context with each GenerateDataKey and Decrypt request that it sends to AWS KMS. The following example shows a JSON representation of the encryption context that Amazon S3 uses.

Copy
{ "aws:s3:arn" : "arn:aws:s3:::S3_bucket_name/S3_object_key" }

Encryption Context for EMRFS Encryption with CSE-KMS

With CSE-KMS, the Amazon EMR cluster uses a CMK to encrypt data before sending it to Amazon S3 for storage. In this case, the cluster uses the Amazon Resource Name (ARN) of the CMK as encryption context with each GenerateDataKey and Decrypt request that it sends to AWS KMS. The following example shows a JSON representation of the encryption context that the cluster uses.

Copy
{ "kms_cmk_id" : "arn:aws:kms:us-east-2:111122223333:key/0987ab65-43cd-21ef-09ab-87654321cdef" }

Encryption Context for Local Disk Encryption with LUKS

When an Amazon EMR cluster uses local disk encryption with LUKS, the cluster nodes do not specify encryption context with the GenerateDataKey and Decrypt requests that they send to AWS KMS.