Instance storage - Amazon EMR

Instance storage

Instance store and Amazon EBS volume storage is used for HDFS data and for buffers, caches, scratch data, and other temporary content that some applications may "spill" to the local file system.

Amazon EBS works differently within Amazon EMR than it does with regular Amazon EC2 instances. Amazon EBS volumes attached to Amazon EMR clusters are ephemeral: the volumes are deleted upon cluster and instance termination (for example, when shrinking instance groups), so it's important that you not expect data to persist. Although the data is ephemeral, it is possible that data in HDFS may be replicated depending on the number and specialization of nodes in the cluster. When you add Amazon EBS storage volumes, these are mounted as additional volumes. They are not a part of the boot volume. YARN is configured to use all the additional volumes, but you are responsible for allocating the additional volumes as local storage (for local log files for example).

Other caveats for using Amazon EBS with Amazon EMR clusters are:

  • You can't snapshot an Amazon EBS volume and then restore it within Amazon EMR. To create reusable custom configurations, use a custom AMI (available in Amazon EMR version 5.7.0 and later). For more information, see Using a custom AMI.

  • An encrypted Amazon EBS root device volume is supported only when using a custom AMI. For more information, see Creating a custom AMI with an encrypted Amazon EBS root device volume.

  • If you apply tags using the Amazon EMR API, those operations are applied to EBS volumes.

  • There is a limit of 25 volumes per instance.

  • The Amazon EBS volumes on core nodes cannot be less than 5 GB.

Default Amazon EBS storage for instances

Amazon EMR automatically attaches an Amazon EBS General Purpose SSD (gp2) 10 GB volume as the root device for its AMIs to enhance performance. In addition, for EC2 instances with EBS-only storage, Amazon EMR allocates Amazon EBS gp2 storage volumes to instances. When you create a cluster with Amazon EMR release version 5.22.0 and later, the default amount of Amazon EBS storage increases based on the size of the instance. We split increased storage across multiple volumes, giving increased IOPS performance and, in turn, increased performance for some standardized workloads. If you want to use a different Amazon EBS gp2 instance storage configuration, you can specify this when you create an Amazon EMR cluster or add nodes to an existing cluster. At this time, Amazon EBS gp3 volumes can't be used as root volumes on an Amazon EMR cluster. You can only use Amazon EBS gp2 volumes as root volumes, and add gp3 volumes as additional volumes. The following table identifies the default number of Amazon EBS gp2 storage volumes, sizes, and total sizes per instance type.

Amazon EBS costs are pro-rated by the hour based on the monthly charges for gp2 volumes in the AWS Region where the cluster runs. For example, the Amazon EBS cost per hour for the root volume on each cluster node in a Region that charges $0.10/GB/month would be approximately $0.00139 per hour ($0.10/GB/month divided by 30 days divided by 24h times 10 GB).

Default Amazon EBS gp2 storage volumes and size by instance type for Amazon EMR 5.22.0 and later
Instance size Number of volumes Volume size (GiB) Total size (GiB)

*.large

1

32

32

*.xlarge

2

32

64

*.2xlarge

4

32

128

*.4xlarge

4

64

256

*.8xlarge

4

128

512

*.9xlarge

4

144

576

*.10xlarge

4

160

640

*.12xlarge

4

192

768

*.16xlarge

4

256

1024

*.18xlarge

4

288

1152

*.24xlarge

4

384

1536

Specifying additional EBS storage volumes

When you configure instance types in Amazon EMR, you can specify additional EBS volumes to add capacity beyond the instance store (if present) and the default EBS volume. Amazon EBS provides the following volume types: General Purpose (SSD), Provisioned IOPS (SSD), Throughput Optimized (HDD), Cold (HDD), and Magnetic. They differ in performance characteristics and price, so you can tailor your storage based on the analytic and business needs of your applications. For example, some applications may have a need to spill to disk while others can safely work in-memory or using Amazon S3.

You can only attach Amazon EBS volumes to instances at cluster startup time and when you add an extra task node instance group. If an instance in an Amazon EMR cluster fails, then both the instance and attached Amazon EBS volumes are replaced with new volumes. Consequently, if you manually detach an Amazon EBS volume, Amazon EMR treats that as a failure and replaces both instance storage (if applicable) and the volume stores.

Amazon EMR doesn’t allow you to modify your volume type from gp2 to gp3 for an existing EMR cluster. To use gp3 for your workloads/use-cases, you need to launch a new EMR cluster. Additionally, we do not recommend updating the throughput and IOPS on a cluster that is in use or that is being provisioned, since Amazon EMR uses the throughput and IOPS values you specified at cluster launch time for any new instance added during cluster scale-up. See Comparing Amazon EBS volume types gp2 and gp3 and Selecting IOPS and throughput when migrating to gp3.

Important

To use a gp3 volume with your EMR cluster, launch a new EMR cluster using the API, SDK or CLI.