| « PreviousNext » | |
![]() ![]() ![]() | Did this page help you? Yes | No | Tell us about it... |
Amazon Elastic MapReduce (Amazon EMR) allows you to choose which version of Hadoop to run. You do this using the CLI and setting the
--hadoop-version as shown in the following table. We recommend using the latest version of Hadoop to take advantage of performance enhancements and new functionality.
| Hadoop Version | Configuration Parameters |
|---|---|
| 1.0.3 |
--hadoop-version 1.0.3 --ami-version 2.3
|
| 0.20.205 |
--hadoop-version 0.20.205 --ami-version 2.0
|
| 0.20 |
--hadoop-version 0.20 --ami-version 1.0
|
| 0.18 |
--hadoop-version 0.18 --ami-version 1.0
|
For details about the default configuration and software available on AMIs used by Amazon Elastic MapReduce (Amazon EMR) see Choose a Machine Image .
Note
The Asia Pacific (Sydney) Region and AWS GovCloud (US) support only Hadoop 1.0.3 and later. AWS GovCloud (US) additionally requires AMI 2.3.0 and later.
To specify the Hadoop version when creating a cluster with the CLI
Add the --hadoop-version option and specify the version number. The following example creates a waiting cluster running Hadoop 1.0.3. Amazon EMR then launches the appropriate AMI for that version of Hadoop. For details about the version of Hadoop available on an AMI, see AMI Versions Supported in Amazon EMR.
In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.
Linux, UNIX, and Mac OS X users:
./elastic-mapreduce --create --alive --name "Test Hadoop" \
--hadoop-version 1.0.3 \
--num-instances 5 --instance-type m1.small Windows users:
ruby elastic-mapreduce --create --alive --name "Test Hadoop" --hadoop-version 1.0.3 --num-instances 5 --instance-type m1.small Hadoop 1.0.3 support in Amazon EMR includes the features listed in Hadoop Common Releases, including:
A RESTful API to HDFS, providing a complete FileSystem implementation for accessing HDFS over HTTP.
Support for executing new writes in HBase while an hflush/sync is in progress.
Performance-enhanced access to local files for HBase.
The ability to run Hadoop, Hive, and Pig jobs as another user, similar to the following:
$ export HADOOP_USER_NAME=usernamehere
By
exporting the HADOOP_USER_NAME environment variable the job would then be executed by the
specified username.
Note
If HDFS is used then you need to either change the permissions on HDFS to allow READ and WRITE access to the specified username
or you can disable permission checks on HDFS. This is done by setting the configuration variable dfs.permissions to false in the
mapred-site.xml file and then restarting the namenodes, similar to the following:
<property> <name>dfs.permissions</name> <value>false</value> </property>
S3 file split size variable renamed from fs.s3.blockSize to fs.s3.block.size, and the default is set to 64 MB. This is for consistency with the variable name added in patch HADOOP-5861.
Setting access permissions on files written to Amazon S3 is also supported in Hadoop 1.0.3 with Amazon EMR. For more information see How to write data to an Amazon S3 bucket you don't own.
For a list of the patches applied to the Amazon EMR version of Hadoop 1.0.3, see Hadoop 1.0.3 Patches.
Hadoop 0.18 was not designed to efficiently handle multiple small files. The following enhancements in Hadoop 0.20 and later improve the performance of processing small files:
Hadoop 0.20 and later assigns multiple tasks per heartbeat. A heartbeat is a method that periodically checks to see if the client is still alive. By assigning multiple tasks, Hadoop can distribute tasks to slave nodes faster, thereby improving performance. The time taken to distribute tasks is an important part of the processing time usage.
Historically, Hadoop processes each task in its own Java Virtual Machine (JVM). If you have many small files that take only a second to process, the overhead is great when you start a JVM for each task. Hadoop 0.20 and later can share one JVM for multiple tasks, thus significantly improving your processing time.
Hadoop 0.20 and later allows you to process multiple files in a single map task, which reduces the overhead associated with setting up a task. A single task can now process multiple small files.
Hadoop 0.20 and later also supports the following features:
A new command line option, -libjars, enables you to include a specified JAR file in the
class path of every task.
The ability to skip individual records rather than entire files. In previous versions of Hadoop, failures in record processing caused the entire file containing the bad record to skip. Jobs that previously failed can now return partial results.
In addition to the Hadoop 0.18 streaming parameters, Hadoop 0.20 and later introduces the three new streaming parameters listed in the following table:
| Parameter | Definition |
|---|---|
-files
| Specifies comma-separated files to copy to the map reduce cluster. |
-archives
| Specifies comma-separated archives to restore to the compute machines. |
-D
| Specifies a value for the key you enter, in the form of <key>=<value>. |
For a list of the patches applied to the Amazon EMR version of Hadoop 0.20.205, see Hadoop 0.20.205 Patches.