How Amazon EMR uses AWS KMS - AWS Key Management Service

How Amazon EMR uses AWS KMS

When you use an Amazon EMR cluster, you can configure the cluster to encrypt data at rest before saving it to a persistent storage location. You can encrypt data at rest on the EMR File System (EMRFS), on the storage volumes of cluster nodes, or both. To encrypt data at rest, you can use an AWS KMS key. The following topics explain how an Amazon EMR cluster uses a KMS key to encrypt data at rest.

Important

Amazon EMR supports only symmetric KMS keys. You cannot use an asymmetric KMS key to encrypt data at rest in an Amazon EMR cluster. For help determining whether a KMS key is symmetric or asymmetric, see Identifying asymmetric KMS keys.

Amazon EMR clusters also encrypt data in transit, which means the cluster encrypts data before sending it through the network. You cannot use a KMS key to encrypt data in transit. For more information, see In-Transit Data Encryption in the Amazon EMR Management Guide.

For more information about all the encryption options available in Amazon EMR, see Encryption Options in the Amazon EMR Management Guide.

Encrypting data on the EMR file system (EMRFS)

Amazon EMR clusters use two distributed files systems:

  • The Hadoop Distributed File System (HDFS). HDFS encryption does not use a KMS key in AWS KMS.

  • The EMR File System (EMRFS). EMRFS is an implementation of HDFS that allows Amazon EMR clusters to store data in Amazon Simple Storage Service (Amazon S3). EMRFS supports four encryption options, two of which use a KMS key in AWS KMS. For more information about all four of the EMRFS encryption options, see Encryption Options in the Amazon EMR Management Guide.

The two EMRFS encryption options that use a KMS key use the following encryption features offered by Amazon S3:

When you configure an Amazon EMR cluster to encrypt data on EMRFS with a KMS key, you choose the KMS key that you want Amazon S3 or the Amazon EMR cluster to use. With SSE-KMS, you can choose the AWS managed key for Amazon S3 with the alias aws/s3, or a symmetric customer managed key that you create. With client-side encryption, you must choose a symmetric customer managed key that you create. When you choose a customer managed key, you must ensure that your Amazon EMR cluster has permission to use the KMS key. For more information, see Using AWS KMS keys for encryption in the Amazon EMR Management Guide.

For both server-side and client-side encryption, the KMS key you choose is the root key in an envelope encryption workflow. The data is encrypted with a unique data key that is encrypted under the KMS key in AWS KMS. The encrypted data and an encrypted copy of its data key are stored together as a single encrypted object in an S3 bucket. For more information about how this works, see the following topics.

Process for encrypting data on EMRFS with SSE-KMS

When you configure an Amazon EMR cluster to use SSE-KMS, the encryption process works like this:

  1. The cluster sends data to Amazon S3 for storage in an S3 bucket.

  2. Amazon S3 sends a GenerateDataKey request to AWS KMS, specifying the key ID of the KMS key that you chose when you configured the cluster to use SSE-KMS. The request includes encryption context; for more information, see Encryption context.

  3. AWS KMS generates a unique data encryption key (data key) and then sends two copies of this data key to Amazon S3. One copy is unencrypted (plaintext), and the other copy is encrypted under the KMS key.

  4. Amazon S3 uses the plaintext data key to encrypt the data that it received in step 1, and then removes the plaintext data key from memory as soon as possible after use.

  5. Amazon S3 stores the encrypted data and the encrypted copy of the data key together as a single encrypted object in an S3 bucket.

The decryption process works like this:

  1. The cluster requests an encrypted data object from an S3 bucket.

  2. Amazon S3 extracts the encrypted data key from the S3 object, and then sends the encrypted data key to AWS KMS with a Decrypt request. The request includes an encryption context.

  3. AWS KMS decrypts the encrypted data key using the same KMS key that was used to encrypt it, and then sends the decrypted (plaintext) data key to Amazon S3.

  4. Amazon S3 uses the plaintext data key to decrypt the encrypted data, and then removes the plaintext data key from memory as soon as possible after use.

  5. Amazon S3 sends the decrypted data to the cluster.

Process for encrypting data on EMRFS with CSE-KMS

When you configure an Amazon EMR cluster to use CSE-KMS, the encryption process works like this:

  1. When it's ready to store data in Amazon S3, the cluster sends a GenerateDataKey request to AWS KMS, specifying the key ID of the KMS key that you chose when you configured the cluster to use CSE-KMS. The request includes encryption context; for more information, see Encryption context.

  2. AWS KMS generates a unique data encryption key (data key) and then sends two copies of this data key to the cluster. One copy is unencrypted (plaintext), and the other copy is encrypted under the KMS key.

  3. The cluster uses the plaintext data key to encrypt the data, and then removes the plaintext data key from memory as soon as possible after use.

  4. The cluster combines the encrypted data and the encrypted copy of the data key together into a single encrypted object.

  5. The cluster sends the encrypted object to Amazon S3 for storage.

The decryption process works like this:

  1. The cluster requests the encrypted data object from an S3 bucket.

  2. Amazon S3 sends the encrypted object to the cluster.

  3. The cluster extracts the encrypted data key from the encrypted object, and then sends the encrypted data key to AWS KMS with a Decrypt request. The request includes encryption context.

  4. AWS KMS decrypts the encrypted data key using the same KMS key that was used to encrypt it, and then sends the decrypted (plaintext) data key to the cluster.

  5. The cluster uses the plaintext data key to decrypt the encrypted data, and then removes the plaintext data key from memory as soon as possible after use.

Encrypting data on the storage volumes of cluster nodes

An Amazon EMR cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a cluster node or node. Each node can have two types of storage volumes: instance store volumes, and Amazon Elastic Block Store (Amazon EBS) volumes. You can configure the cluster to use Linux Unified Key Setup (LUKS) to encrypt both types of storage volumes on the nodes (but not the boot volume of each node). This is called local disk encryption.

When you enable local disk encryption for a cluster, you can choose to encrypt the LUKS key with a KMS key in AWS KMS. You must choose a customer managed key that you create; you cannot use an AWS managed key. If you choose a customer managed key, you must ensure that your Amazon EMR cluster has permission to use the KMS key. For more information, see Using AWS KMS keys for encryption in the Amazon EMR Management Guide.

When you enable local disk encryption using a KMS key, the encryption process works like this:

  1. When each cluster node launches, it sends a GenerateDataKey request to AWS KMS, specifying the key ID of the KMS key that you chose when you enabled local disk encryption for the cluster.

  2. AWS KMS generates a unique data encryption key (data key) and then sends two copies of this data key to the node. One copy is unencrypted (plaintext), and the other copy is encrypted under the KMS key.

  3. The node uses a base64-encoded version of the plaintext data key as the password that protects the LUKS key. The node saves the encrypted copy of the data key on its boot volume.

  4. If the node reboots, the rebooted node sends the encrypted data key to AWS KMS with a Decrypt request.

  5. AWS KMS decrypts the encrypted data key using the same KMS key that was used to encrypt it, and then sends the decrypted (plaintext) data key to the node.

  6. The node uses the base64-encoded version of the plaintext data key as the password to unlock the LUKS key.

Encryption context

Each AWS service integrated with AWS KMS can specify an encryption context when the service uses AWS KMS to generate data keys or to encrypt or decrypt data. Encryption context is additional authenticated information that AWS KMS uses to check for data integrity. When a service specifies encryption context for an encryption operation, it must specify the same encryption context for the corresponding decryption operation or decryption will fail. Encryption context is also written to AWS CloudTrail log files, which can help you understand why a specific KMS key was used.

The following section explain the encryption context that is used in each Amazon EMR encryption scenario that uses a KMS key.

Encryption context for EMRFS encryption with SSE-KMS

With SSE-KMS, the Amazon EMR cluster sends data to Amazon S3, and then Amazon S3 uses a KMS key to encrypt the data before saving it to an S3 bucket. In this case, Amazon S3 uses the Amazon Resource Name (ARN) of the S3 object as encryption context with each GenerateDataKey and Decrypt request that it sends to AWS KMS. The following example shows a JSON representation of the encryption context that Amazon S3 uses.

{ "aws:s3:arn" : "arn:aws:s3:::S3_bucket_name/S3_object_key" }

Encryption context for EMRFS encryption with CSE-KMS

With CSE-KMS, the Amazon EMR cluster uses a KMS key to encrypt data before sending it to Amazon S3 for storage. In this case, the cluster uses the Amazon Resource Name (ARN) of the KMS key as encryption context with each GenerateDataKey and Decrypt request that it sends to AWS KMS. The following example shows a JSON representation of the encryption context that the cluster uses.

{ "kms_cmk_id" : "arn:aws:kms:us-east-2:111122223333:key/0987ab65-43cd-21ef-09ab-87654321cdef" }

Encryption context for local disk encryption with LUKS

When an Amazon EMR cluster uses local disk encryption with LUKS, the cluster nodes do not specify encryption context with the GenerateDataKey and Decrypt requests that they send to AWS KMS.