Menu
Amazon EMR
Management Guide

Instance Store and Amazon EBS

There are two types of storage volumes available for EC2 instances: Amazon EBS volumes and the instance store. With Amazon EMR, both types of storage are ephemeral, meaning the data on the volumes does not persist through instance termination. This ephemeral storage is ideal for temporary storage of information that changes frequently, such as HDFS data, as well as buffers, caches, scratch data, and other temporary content that some applications may "spill" to the local file system. Although this ephemeral storage is used for HDFS, EMRFS can help ensure that there is a persistent "source of truth" for data stored in Amazon S3. For more information about EMRFS, see EMR File System (EMRFS). For more information about EC2 instance store, see Amazon EC2 Instance Store in the Amazon EC2 User Guide for Linux Instances.

Note

Amazon EBS storage is available for releases 4.0 or greater.

Whether the root device volume uses the instance store or an EBS volume depends on the AMI. Some AMIs are backed by Amazon EC2 instance store, and some are backed by Amazon EBS. For more information, see Amazon EC2 Root Device Volume in the Amazon EC2 User Guide for Linux Instances. Amazon EMR automatically attaches an EBS General Purpose SSD (gp2) 10 GB volume as the root device for its AMIs to enhance performance. The EBS costs are pro-rated by the hour based on the monthly EBS charges for gp2 volumes in the region where the cluster runs. For example, the EBS cost per hour for the root volume for each node in your EMR cluster in a region that charges $0.10/GB-Month would be approximately $0.00139 per hour ($0.10/GB-month divided by 30 days divided by 24h times 10 GB).

When you configure instance types in Amazon EMR, you can specify additional EBS volumes, which adds capacity beyond the instance store (if present) and the default EBS volume. Amazon EBS provides the following volume types: General Purpose (SSD), Provisioned IOPS (SSD), Throughput Optimized (HDD), Cold (HDD), and Magnetic. They differ in performance characteristics and price, so you can tailor your storage based on the analytic and business needs of your applications. For example, some applications may have a need to spill to disk while others can safely work in-memory or using Amazon S3.

Amazon EBS works differently within Amazon EMR than it does with regular Amazon EC2 instances. For example, EBS volumes attached to EMR clusters are ephemeral: the volumes are deleted upon cluster and instance termination (for example, when shrinking instance groups), so it is important to not expect data persistence. Although the data is ephemeral on these volumes, it is possible that data in HDFS may be replicated depending on the number and specialization of nodes in the cluster. When you add EBS volumes, these are mounted as additional volumes. They are not a part of the boot volume and represent additional storage. YARN is configured to use all the additional volumes, but you are responsible for allocating the additional volumes as local storage (for local log files for example).

You can only attach EBS volumes to instances at cluster startup time unless you add an extra task node instance group, at which time you can add EBS volumes. If an instance in an EMR cluster fails, then both the instance and attached EBS volumes are replaced as new. Consequently, if you manually detach an EBS volume, Amazon EMR treats that as a failure and replaces both instance storage (if applicable) and the volume stores.

Other caveats for using Amazon EBS with EMR clusters are:

  • You cannot snapshot an EBS volume used with Amazon EMR.

  • EBS-encrypted volumes are not supported.

  • If you apply tags using the Amazon EMR webservice API, those operations are applied to EBS volumes.

  • There is a limit of 25 volumes per instance.