|« PreviousNext »|
|Did this page help you? Yes | No | Tell us about it...|
Amazon Elastic MapReduce (Amazon EMR) and Hadoop provide a variety of file systems you can use when processing cluster steps. You specify which file system to use by the prefix of the URI used to access the data. For example, s3://myawsbucket/path references an Amazon S3 bucket using the S3 native file system. The following table lists the available file systems, with recommendations on when it's best to use them.
Amazon EMR and Hadoop typically use two or more of the following file systems when processing a cluster. HDFS and S3N are the two main file systems used with Amazon EMR.
HDFS is a distributed, scalable, and portable file system for Hadoop. An advantage of HDFS is data awareness between the Hadoop cluster nodes managing the clusters and the Hadoop cluster nodes managing the individual steps. For more information about how HDFS works, see the Hadoop documentation.
HDFS is used by the master and core nodes. Its advantage is that it's fast; its disadvantage is that it's ephemeral storage which is reclaimed when the cluster ends. It's best used for caching the results produced by intermediate job-flow steps.
|Amazon S3 native||
The Amazon S3 Native File System (S3N) is a file system for reading and writing regular files on Amazon S3. The advantage of this file system is that you can access files on Amazon S3 that were written using other tools.
On Amazon EMR,
S3N is a persistent and fault-tolerant file system. It continues to exist after the cluster ends. The disadvantage is that it's slower than HDFS because of the round-trip to Amazon S3. S3N is best used for storing the input to a cluster, the output of the cluster, and the results of intermediate cluster steps where re-computing the step would be onerous.
Note that paths that specify a bucket name only must end with a terminating slash.
For example, specify
|(Legacy) Amazon S3 block||
The Amazon S3 Block File System Files is a legacy file storage system. We strongly discourage the use of this system.
We do not recommend that you use this file system because it can trigger a race condition that might cause your cluster to fail. However, it might be required by legacy applications.
|local file system||
The local file system refers to a locally connected disk. When a Hadoop cluster is created, each node is created from an EC2 instance that comes with a preconfigured block of preattached disk storage called an instance store. Data on instance store volumes persists only during the life of its EC2 instance. Instance store volumes are ideal to store temporary data that is continually changing, such as buffers, caches, scratch data, and other temporary content. For more information, see Amazon EC2 Instance Storage.
The S3 native file system imposes a 5 GiB file-size limit. You might need to upload or store files larger than 5 GiB with Amazon S3. Amazon EMR makes this possible by extending the S3 file system through the AWS Java SDK to support multipart uploads. Using this feature of Amazon EMR you can upload files of up to 5 TiB in size. Multipart upload is disabled by default; to learn how to enable it for your cluster, see Configure Multipart Upload for Amazon S3.
You specify which file system to use by the prefix of the uniform resource identifier (URI) used to access the data. The following procedures illustrate how to reference several different types of file systems.
To access a local HDFS
Specify the hdfs:/// prefix in the URI. Amazon EMR resolves paths that do not specify a prefix in the URI to the local HDFS. For example, both of the following URIs would resolve to the same location in HDFS.
To access a remote HDFS
Include the IP address of the master node in the URI as shown in the following examples.
To access the Amazon S3 native file system
Use the s3n:// or s3:// prefix. Amazon EMR resolves both of the URIs below to the same location.
Because of the file syntax difference between Hadoop running on Amazon EMR and standard Apache Hadoop, it is recommended that you use the s3n:// prefix to highlight the fact that you are using the S3 native file system.
To access the Amazon S3 block file system
Use only for legacy applications that require the Amazon S3 block file system. To access or store data with this file system, use the s3bfs:// prefix in the URI.
The Amazon S3 block file system is a legacy file system that was used to support uploads to Amazon S3 that were larger than 5 GiB in size. With the multipart upload functionality Amazon EMR provides through the AWS Java SDK, you can upload files of up to 5 TiB in size to the Amazon S3 native file system, and the Amazon S3 block file system is deprecated.
Because this legacy file system can create race conditions that can corrupt the file system, you should avoid this format and use the Amazon S3 native file system instead.