| « PreviousNext » | |
![]() ![]() ![]() | Did this page help you? Yes | No | Tell us about it... |
Amazon Elastic MapReduce (Amazon EMR) and Hadoop provide a variety of file systems you can use when processing cluster steps. You specify which file system to use by the prefix of the URI used to access the data. For example, s3://myawsbucket/path references an Amazon S3 bucket using the S3 native file system. The following table lists the available file systems, with recommendations on when it’s best to use them.
Amazon EMR and Hadoop typically use two or more of the following file systems when processing a cluster. HDFS and S3N are the two main file systems used with Amazon EMR.
| File System | Prefix | Description |
|---|---|---|
| HDFS | hdfs:// or no prefix |
HDFS is a distributed, scalable, and portable file system for Hadoop. An advantage of HDFS is data awareness between the Hadoop cluster nodes managing the clusters and the Hadoop cluster nodes managing the individual steps. For more information about how HDFS works, see http://hadoop.apache.org/docs/stable/. HDFS is used by the master and core nodes. Its advantage is that it’s fast; its disadvantage is that it’s ephemeral storage which is reclaimed when the cluster ends. It’s best used for caching the results produced by intermediate job-flow steps. |
| Amazon S3 native | s3:// or s3n:// |
The Amazon S3 Native File System (S3N) is a file system for reading and writing regular files on Amazon S3. The advantage of this file system is that you can access files on Amazon S3 that were written with other tools. For information on how Amazon S3 and Hadoop work together, see http://wiki.apache.org/hadoop/AmazonS3. Amazon S3 native is a persistent and fault-tolerant file system. It continues to exist after the cluster ends. Its disadvantage is that it’s slower than HDFS because of the round-trip to Amazon S3. It’s best used for storing the input to a cluster, the output of the cluster, and the results of intermediate cluster steps where re-computing the step would be onerous. Note Paths that specify only a bucket name must end with a terminating slash. In other words, all Amazon S3 URIs must have at least three slashes. For example, specify s3n://myawsbucket/ instead of s3n://myawsbucket. The URI s3n://myawsbucket/myfolder, however, is also valid. |
| Amazon S3 block | s3bfs:// |
The Amazon S3 Block File System Files is a legacy file storage system. We strongly discourage the use of this system. Amazon S3 block is a deprecated file system that is not recommended because it can trigger a race condition that might cause your cluster to fail. It may be required by legacy applications. |
| local file system |
The local file system refers to a locally connected disk. When a Hadoop cluster is created, each node is created from an EC2 instance which comes with a preconfigured block of preattached disk storage called an Amazon EC2 local instance store. Data on instance store volumes persists only during the life of the associated EC2 instance. The amount of this disk storage varies by EC2 instance type. It is ideal for temporary storage of information that is continually changing, such as buffers, caches, scratch data, and other temporary content. For more information about EC2 instances, see Amazon Elastic Compute Cloud. |
Note
The configuration of Hadoop running on Amazon EMR differs from the default configuration provided by Apache Hadoop. On Amazon EMR, s3n:// and s3:// both map to the Amazon S3 native file system, while in the default configuration provided by Apache Hadoop s3:// is mapped to the Amazon S3 block storage system.
The S3 native file system imposes a 5 GiB file-size limit. You might need to upload or store files larger than 5 GiB with Amazon S3. Amazon EMR makes this possible by extending the S3 file system through the AWS Java SDK to support multipart uploads. Using this feature of Amazon EMR you can upload files of up to 5 TiB in size. Multipart upload is disabled by default; to learn how to enable it for your cluster, see the section called “Configure Multipart Upload for Amazon S3”.
You specify which file system to use by the prefix of the uniform resource identifier (URI) used to access the data. The following procedures illustrate how to reference several different types of file systems.
To access a local HDFS
Specify the hdfs:/// prefix in the URI. Amazon EMR resolves paths that do not specify a prefix in the URI to the local HDFS. For example, both of the following URIs would resolve to the same location in HDFS.
hdfs:///path-to-data/path-to-data
To access a remote HDFS
Include the IP address of the master node in the URI as shown in the following examples.
hdfs://master-ip-address/path-to-datamaster-ip-address/path-to-data
To access the Amazon S3 native file system
Use the s3n:// or s3:// prefix. Amazon EMR resolves both of the URIs below to the same location.
s3n://bucket-name/path-to-file-in-buckets3://bucket-name/path-to-file-in-bucket
Note
Because of the file syntax difference between Hadoop running on Amazon EMR and standard Apache Hadoop, it is recommended that you use the s3n:// prefix to highlight the fact that you are using the S3 native file system.
To access the Amazon S3 block file system
Use only for legacy applications that require the Amazon S3 block file system. To access or store data with this file system, use the s3bfs:// prefix in the URI.
The Amazon S3 block file system is a legacy file system that was used to support uploads to Amazon S3 that were larger than 5 GiB in size. With the multipart upload functionality Amazon EMR provides through the AWS Java SDK, you can upload files of up to 5 TiB in size to the Amazon S3 native file system, and the Amazon S3 block file system is deprecated.
Caution
Because this legacy file system can create race conditions that can corrupt the file system, you should avoid this format and use the Amazon S3 native file system instead.
s3bfs://bucket-name/path-to-file-in-bucket