Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

File Systems compatible with Amazon EMR

Amazon Elastic MapReduce (Amazon EMR) and Hadoop provide a variety of file systems you can use when processing cluster steps. You specify which file system to use by the prefix of the URI used to access the data. For example, s3://myawsbucket/path references an Amazon S3 bucket using the S3 native file system. The following table lists the available file systems, with recommendations on when it's best to use them.

Amazon EMR and Hadoop typically use two or more of the following file systems when processing a cluster. HDFS and S3N are the two main file systems used with Amazon EMR.

File SystemPrefixDescription
HDFShdfs:// (or no prefix)

HDFS is a distributed, scalable, and portable file system for Hadoop. An advantage of HDFS is data awareness between the Hadoop cluster nodes managing the clusters and the Hadoop cluster nodes managing the individual steps. For more information about how HDFS works, see the Hadoop documentation.

HDFS is used by the master and core nodes. Its advantage is that it's fast; its disadvantage is that it's ephemeral storage which is reclaimed when the cluster ends. It's best used for caching the results produced by intermediate job-flow steps.

Amazon S3 natives3n://

The Amazon S3 Native File System (S3N) is a file system for reading and writing regular files on Amazon S3. The advantage of this file system is that you can access files on Amazon S3 that were written using other tools.

Important

On Amazon EMR, s3n:// and s3:// both map to the Amazon S3 native file system. In the default configuration of Apache Hadoop, s3:// is mapped to the Amazon S3 block storage system. For more information about how Amazon S3 and Hadoop work together, see Amazon S3 on the Hadoop Wiki.

S3N is a persistent and fault-tolerant file system. It continues to exist after the cluster ends. The disadvantage is that it's slower than HDFS because of the round-trip to Amazon S3. S3N is best used for storing the input to a cluster, the output of the cluster, and the results of intermediate cluster steps where re-computing the step would be onerous.

Note that paths that specify a bucket name only must end with a terminating slash. For example, specify s3n://myawsbucket/ instead of s3n://myawsbucket. Ending with a folder, for example s3n://myawsbucket/myfolder, is also fine.

(Legacy) Amazon S3 blocks3bfs://

The Amazon S3 Block File System Files is a legacy file storage system. We strongly discourage the use of this system.

Important

We do not recommend that you use this file system because it can trigger a race condition that might cause your cluster to fail. However, it might be required by legacy applications.

local file system 

The local file system refers to a locally connected disk. When a Hadoop cluster is created, each node is created from an EC2 instance that comes with a preconfigured block of preattached disk storage called an instance store. Data on instance store volumes persists only during the life of its EC2 instance. Instance store volumes are ideal to store temporary data that is continually changing, such as buffers, caches, scratch data, and other temporary content. For more information, see Amazon EC2 Instance Storage.

Upload Large Files with the S3 Native File System

The S3 native file system imposes a 5 GiB file-size limit. You might need to upload or store files larger than 5 GiB with Amazon S3. Amazon EMR makes this possible by extending the S3 file system through the AWS Java SDK to support multipart uploads. Using this feature of Amazon EMR you can upload files of up to 5 TiB in size. Multipart upload is disabled by default; to learn how to enable it for your cluster, see Configure Multipart Upload for Amazon S3.

Access File Systems

You specify which file system to use by the prefix of the uniform resource identifier (URI) used to access the data. The following procedures illustrate how to reference several different types of file systems.

To access a local HDFS

  • Specify the hdfs:/// prefix in the URI. Amazon EMR resolves paths that do not specify a prefix in the URI to the local HDFS. For example, both of the following URIs would resolve to the same location in HDFS.

    hdfs:///path-to-data
    							
    /path-to-data
                

To access a remote HDFS

  • Include the IP address of the master node in the URI as shown in the following examples.

    hdfs://master-ip-address/path-to-data
    						
    master-ip-address/path-to-data
                

To access the Amazon S3 native file system

  • Use the s3n:// or s3:// prefix. Amazon EMR resolves both of the URIs below to the same location.

    s3n://bucket-name/path-to-file-in-bucket
    						
    s3://bucket-name/path-to-file-in-bucket
    					

    Note

    Because of the file syntax difference between Hadoop running on Amazon EMR and standard Apache Hadoop, it is recommended that you use the s3n:// prefix to highlight the fact that you are using the S3 native file system.

To access the Amazon S3 block file system

  • Use only for legacy applications that require the Amazon S3 block file system. To access or store data with this file system, use the s3bfs:// prefix in the URI.

    The Amazon S3 block file system is a legacy file system that was used to support uploads to Amazon S3 that were larger than 5 GiB in size. With the multipart upload functionality Amazon EMR provides through the AWS Java SDK, you can upload files of up to 5 TiB in size to the Amazon S3 native file system, and the Amazon S3 block file system is deprecated.

    Caution

    Because this legacy file system can create race conditions that can corrupt the file system, you should avoid this format and use the Amazon S3 native file system instead.

    s3bfs://bucket-name/path-to-file-in-bucket