Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Choose a Machine Image

Amazon Elastic MapReduce (Amazon EMR) uses Amazon Machine Images (AMIs) to initialize the EC2 instances it launches to run a cluster. The AMIs contain the Linux operating system, Hadoop, and other software used to run the cluster. These AMIs are specific to Amazon EMR and can be used only in the context of running a cluster. Periodically, Amazon EMR updates these AMIs with new versions of Hadoop and other software, so users can take advantage of improvements and new features.

For general information about AMIs, go to Using AMIs in the Amazon Elastic Compute Cloud User Guide. For details about the software versions included in the Amazon EMR AMIs, go to the section called “AMI Versions Supported in Amazon EMR”.

If your application depends on a specific version or configuration of Hadoop, you might want delay upgrading to the new AMI until you have tested your application on it. AMI versioning gives you the option to specify which AMI version your cluster uses to launch EC2 instances.

Specifying the AMI version during cluster creation is optional; if you do not provide an AMI-version parameter, and you are using the CLI, your clusters will run on the most recent AMI version. This means you always have the latest software running on your clusters, but you must ensure that your application will work with new changes as they are released.

If you specify an AMI version when you create a cluster, your instances will be created using that AMI. This provides stability for long-running or mission-critical applications. The trade-off is that your application will not have access to new features on more up-to-date AMI versions.

AMI Version Numbers

AMI version numbers are composed of three parts major-version.minor-version.patch. The current version of the Amazon EMR CLI provides three ways to specify which version of the AMI to use to launch your cluster.

  • Fully specified—If you specify the AMI version using all three parts (e.g. --ami-version 2.0.1) your cluster will be launched on exactly that version. The preceding example would launch a cluster using AMI 2.0.1. This is useful if you are running an application that depends on a specific AMI version and you want to ensure that AMI version is the one used to launch your clusters. The downside is you will not benefit from new features and improvements that are released on subsequent AMIs.

  • Major-minor version specified—If you specify just the major and minor version for the AMI (e.g. --ami-version 2.0), your cluster will be launched on the AMI that matches those specifications and which has the latest patches. The preceding example would launch a cluster using AMI 2.0.4, since .4 is the latest patch for the 2.0 AMI series that is not deprecated. This scenario ensures a measure of stability in the AMI version, while ensuring that you receive the benefits of new patches and bug releases.

  • Latest version specified—If you use the keyword latest instead of a version number for the AMI (e.g. --ami-version latest), the cluster is launched with the latest version available. This is the most dynamic way to run your clusters, as AMIs are updated regularly. This configuration is best for prototyping and testing, and is not recommended for production environments.

Default AMI and Hadoop Versions

If you don't specify the AMI for the cluster, Amazon EMR launches your cluster with the default version. The default versions returned depend on the interface you use to launch the cluster.

Note

The default AMI is unavailable in the Asia Pacific (Sydney) Region. Instead, use the --ami-version latest keyword to specify the latest AMI for that region instead.

InterfaceDefault AMI and Hadoop versions
Amazon EMR consolelatest AMI and Hadoop versions
APIAMI 1.0, Hadoop 0.18
SDKAMI 1.0, Hadoop 0.18
CLI (version 2012-07-30) and laterlatest AMI and Hadoop versions
CLI (versions 2011-12-08 to 2012-07-09)AMI 2.1.3, Hadoop 0.20.205
CLI (version 2011-12-11 and earlier)AMI 1.0, Hadoop 0.18

To determine which version of the CLI you have installed

  • In the directory where you installed the CLI, run the following from the command line:

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --version
    • Windows users:

      ruby elastic-mapreduce --version

Specifying the AMI Version for a New Cluster

You can specify which AMI version a new cluster should use when you create it. For details about the default configuration and applications available on AMI versions, see AMI Versions Supported in Amazon EMR.

To specify an AMI version using the CLI

  • When creating a cluster using the CLI, add the --ami-version parameter. If you do not specify this parameter, or if you specify --ami-version latest the most recent version of AMI will be used.

    The following example specifies the AMI completely and will launch a cluster on AMI 2.0.1.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Static AMI Version" \
      --ami-version 2.0.1 \
      --num-instances 5 --instance-type m1.small  
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Static AMI Version" --ami-version 2.0.1 --num-instances 5 --instance-type m1.small  

    The following example specifies the AMI using just the major and minor version. It will launch the cluster on the AMI that matches those specifications and which has the latest patches. This example would launch a cluster using AMI 2.0.5, since .5 is the latest patch for the 2.0 AMI series.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Major-Minor AMI Version" \
      --ami-version 2.0 \
      --num-instances 5 --instance-type m1.small  
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Major-Minor AMI Version" --ami-version 2.0 --num-instances 5 --instance-type m1.small  

    The following example specifies that the cluster should be launched with the most current version available.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --name "Latest AMI Version" \
      --ami-version latest \
      --num-instances 5 --instance-type m1.small 
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Latest AMI Version" --ami-version latest --num-instances 5 --instance-type m1.small 

To specify an AMI version using the API

  • When creating a cluster using the API, add the AmiVersion and the HadoopVersion parameters to the request string, as shown in the following example. If you do not specify these parameters, Amazon EMR will create the cluster using the version 1.0 AMI and Hadoop 0.20. For more information, go to RunJobFlow in the Amazon Elastic MapReduce API Reference.

    https://elasticmapreduce.amazonaws.com?Operation=RunJobFlow
    &Name=MyJobFlowName
    &LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir
    &AmiVersion=1.0
    &HadoopVersion=0.20	
    &Instances.MasterInstanceType=m1.small
    &Instances.SlaveInstanceType=m1.small
    &Instances.InstanceCount=4
    &Instances.Ec2KeyName=myec2keyname
    &Instances.Placement.AvailabilityZone=us-east-1a
    &Instances.KeepJobFlowAliveWhenNoSteps=true
    &Steps.member.1.Name=MyStepName
    &Steps.member.1.ActionOnFailure=CONTINUE
    &Steps.member.1.HadoopJarStep.Jar=MyJarFile
    &Steps.member.1.HadoopJarStep.MainClass=MyMainClass
    &Steps.member.1.HadoopJarStep.Args.member.1=arg1
    &Steps.member.1.HadoopJarStep.Args.member.2=arg2
    &AuthParams	
    

Check the AMI Version of a Running Cluster

If you need to find out which AMI version a cluster is running, you can retrieve this information using the console, the CLI, or the API.

To check the current AMI version using the console

  1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/vnext/.

  2. Click on a cluster. The Ami Version and other details about the cluster are displayed in the Summary pane.

To check the current AMI version using the CLI

  • Use the --describe parameter to retrieve the AMI version on a cluster. In the following example JobFlowID is the identifier of the cluster. The AMI version will be returned along with other information about the cluster.

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --describe -–jobflow JobFlowID
    • Windows users:

      ruby elastic-mapreduce --describe -–jobflow JobFlowID

To check the current AMI version using the API

  • Call DescribeJobFlows to check which AMI version a cluster is using. The version will be returned as part of the response data, as shown in the following example. For the complete response syntax, go to DescribeJobFlows in the Amazon Elastic MapReduce API Reference.

    <DescribeJobFlowsResponse xmlns=&quot;http://elasticmapreduce.&api-domain;/doc/2009-03-31&quot;>
       <DescribeJobFlowsResult> 
          <JobFlows> 
             <member>
    		...
                <AmiVersion>
                   2.1.3
                </AmiVersion>
    		...
             </member>
          </JobFlows> 
       </DescribeJobFlowsResult> 
       <ResponseMetadata>
          <RequestId> 
             9cea3229-ed85-11dd-9877-6fad448a8419 
          </RequestId>
       </ResponseMetadata> 
    </DescribeJobFlowsResponse> 	
    

Amazon EMR AMIs and Hadoop Versions

An AMI can contain multiple versions of Hadoop. If the AMI you specify has multiple versions of Hadoop available, you can select the version of Hadoop you want to run as described in Hadoop Configuration Reference. You cannot specify a Hadoop version that is not available on the AMI. For a list of the versions of Hadoop supported on each AMI, go to AMI Versions Supported in Amazon EMR.

Amazon EMR AMI Deprecation

Eighteen months after an AMI version is released, the Amazon EMR team might choose to deprecate that AMI version and no longer support it. In addition, the Amazon EMR team might deprecate an AMI before eighteen months has elapsed if a security risk or other issue is identified in the software or operating system of the AMI. If a cluster is running when its AMI is depreciated, the cluster will not be affected. You will not, however, be able to create new clusters with the deprecated AMI version. The best practice is to plan for AMI obsolescence and move to new AMI versions as soon as is practical for your application.

Before an AMI is deprecated, the Amazon EMR team will send out an announcement specifying the date on which the AMI version will no longer be supported.

AMI Versions Supported in Amazon EMR

Amazon EMR supports the AMI versions listed in the following table. You can specify the AMI version to use when you create a cluster. If you do not specify an AMI version, Amazon EMR creates the cluster using the default AMI version. For information about default AMI configurations, see Default AMI and Hadoop Versions.

AMI VersionDescriptionRelease Date
2.4.5

This Amazon EMR AMI version provides the following features:

  • Adds support for AWS SDK 1.7.0

  • Adds support for Python 2.7.

  • Adds support for Hive 0.11.0.2.

  • Upgrades Protobuf to version 2.5.

    Note

    The upgrade to Protobuf 2.5 requires you to regenerate and recompile any of your Java code that was previously generated by the protoc tool.

  • Updates to Java version to 7u60 (early access release). For more information, go to JDK 7 Update 60 Early Access Release.

  • Updates Jetty to version 6.1.26.emr.1 that fixes Hadoop MapReduce issue MAPREDUCE-2980.

  • Fixes an issue encountered when no log-uri is specified at cluster creation.

  • Fixes version utility to accurately display Amazon Hadoop Distribution version.

  • Other improvements and bug fixes.

27 March 2014
3.0.4

This Amazon EMR AMI version provides the following features:

  • Adds a connector for Amazon Kinesis, which allows users to process streaming data using standard Hadoop and ecosystem tools within Amazon EMR clusters. For more information, see Analyze Amazon Kinesis Data.

  • Fixes an issue in the yarn-site.xml configuration file, which resulted in the JobHistory server not being fully configured.

  • Adds support for AWS SDK 1.7.0.

Software versions:

  • Hadoop 2.2.0

  • Hive 0.11.0.2

  • Pig 0.11.1.1

  • Impala 1.2.1

  • Python 2.7.5

  • R 3.0.1

19 February 2014
3.0.3

This Amazon EMR AMI version provides the following features:

  • Adds support for AWS SDK 1.6.10.

  • Upgrades HttpClient to version 4.2 to be compatible with AWS SDK 1.6.10.

  • Fixes a problem related to orphaned Amazon EBS volumes.

  • Adds support for Hive 0.11.0.2.

  • Upgrades Protobuf to version 2.5.

    Note

    The upgrade to Protobuf 2.5 requires you to regenerate and recompile any of your Java code that was previously generated by the protoc tool.

11 February 2014
2.4.3

This Amazon EMR AMI version provides the following features:

  • Adds support for Python 2.7.

  • Updates Jetty to version 6.1.26.emr.1 that fixes the Hadoop MapReduce issue MAPREDUCE-2980.

  • Updates to Java version to 7u60 (early access release). For more information, go to JDK 7 Update 60 Early Access Release.

  • Adds support for Hive 0.11.0.2.

  • Upgrades Protobuf to version 2.5.

    Note

    The upgrade to Protobuf 2.5 requires you to regenerate and recompile any of your Java code that was previously generated by the protoc tool.

3 January 2014
3.0.2

This Amazon EMR AMI version provides the following features:

  • Adds support for Impala 1.2.1 with Hadoop 2. For more information, see Analyze Data with Impala.

  • Changes the uploadMultiParts function to use a retry policy.

12 December 2013
3.0.1

This Amazon EMR AMI version provides the following features:

  • Adds support for viewing Hadoop 2 task attempt logs in the EMR console.

  • Fixes an issue with R 3.0.1.

  • Includes Hadoop 2.2.0, Mahout 0.8, Perl 5.10.1, PHP 5.3.27, Python 2.6.8, Python 2.7.5, R 3.0.1, Ruby 1.9.3, and Oracle/Sun jdk-7u45.

8 November 2013
3.0.0

This new major Amazon EMR AMI version provides the following features:

  • This Amazon EMR AMI is based on the Amazon Linux Release 2012.09. For more information, see Amazon Linux AMI 2012.09 Release Notes.

  • Adds support for Hadoop 2.2.0. For more information, see Supported Hadoop Versions.

  • Adds support for HBase 0.94.7. For more information, go to the Apache HBase web site.

  • Adds Java 7 support for Hadoop, HBase, and Pig.

  • Includes Hadoop 2.2.0, Mahout 0.8, Perl 5.10.1, PHP 5.3.27, Python 2.6.8, Python 2.7.5, R 3.0.1, Ruby 1.9.3, and Oracle/Sun jdk-7u45.

28 October 2013
2.4.2

Same as the previous AMI version, with the following additions:

  • Fixed a bug in host resolution that limited map-side local data optimization. Customers who use Fair Scheduler may observe a change in job execution due to the emphasis the system puts on data locality. The schedule may now hold back tasks to run them locally.

  • Includes Hadoop 1.0.3, Java 1.7, Perl 5.10.1, Python 2.6.6, and R 2.11

7 October 2013
2.4.1

Same as the previous AMI version, with the following additions:

  • Fixes a bug that causes the HBase shell not to work properly.

  • Fixes a bug that causes some clusters to fail with the error ‘concurrent modifications exception’.

  • Adds new logic in the instance controller to detect and reboot instances that have been blacklisted by Hadoop for an extended period of time.

  • Includes Hadoop 1.0.3, Java 1.7, Perl 5.10.1, Python 2.6.6, and R 2.11

20 August 2013
2.4

Same as the previous AMI version, with the following additions:

  • Adds support for Java 7 with Hadoop and HBase. Other Amazon EMR features, such as Hive and Pig, continue to require Java 6.

  • Improved JobTracker detection and response time when reducers become stuck due to a problematic mapper.

  • Fixes a problem that some Hadoop reducers are unable to fetch map output data due to a bad mapper, causing job delays.

  • Adds FetchStatusMap to keep track of all fetch errors and success along with their time stamp.

  • Fixes a problem with "Text File Busy" errors when launching tasks. For more information, go to MAPREDUCE-2374.

1 August 2013
2.3.6

Same as 2.3.5, with the following additions:

  • Fixes a problem in the Debian sources.lst and preferences files that caused certain bootstrap actions to fail, including Ganglia. Customers using AMI versions 2.0.0 to 2.3.5 may notice an additional bootstrap action in their list named EMR Debian Patch.

17 May 2013
2.3.5

Same as 2.3.3, with the following additions:

  • Fixes an S3DistCp bug which created invalid manifest file entries for certain URL encoded file names.

  • Improves log pushing functionality and adds a 7 day retention policy for on-cluster log files. Log files not modified for 7 or more days are deleted from the cluster.

  • Adds a streaming configuration option for not emitting the mapper key. For more information, go to MAPREDUCE-1785.

  • Adds the --s3ServerSideEncryption option to the S3DistCp tool. For more information, see S3DistCp Options.

26 April 2013
2.3.4

Deprecated

16 April 2013
2.3.3

Same as 2.3.2, with the following additions:

  • Improved CloudWatch LiveTaskTracker metric to take into account expired Hadoop TaskTrackers and minor improvements in Hadoop.

01 March 2013
2.3.2

Same as 2.3.1, with the following additions:

  • Fixes an issue which prevented customers from using the debugging feature in the Amazon EMR console.

07 February 2013
2.3.1

Same as 2.3.0, with the following additions:

  • Improves support for clusters running on hs1.8xlarge instances.

24 December 2012
2.3.0

Same as 2.2.4, with the following additions:

20 December 2012
2.2.4

Same as 2.2.3, with the following additions:

  • Improves error handling in the Snappy decompressor. For more information, go to HADOOP-8151.

  • Fixes an issue with MapFile.Reader reading LZO or Snappy compressed files. For more information, go to HADOOP-8423.

  • Updates the kernel to the AWS version of 3.2.30-49.59.

6 December 2012
2.2.3

Same as 2.2.1, with the following additions:

  • Improves HBase backup functionality.

  • Updates the AWS SDK for Java to version 1.3.23.

  • Resolves issues with the job tracker user interface.

  • Improves Amazon S3 file system handling in Hadoop.

  • Improves to NameNode functionality in Hadoop.

30 November 2012
2.2.2

Deprecated

23 November 2012
2.2.1

Same as 2.2.0, with the following additions:

  • Fixes an issue with HBase backup functionality.

  • Enables multipart upload by default for files larger than the Amazon S3 block size specified by fs.s3n.blockSize. For more information, see Configure Multipart Upload for Amazon S3.

30 August 2012
2.2.0

Same as 2.1.3, with the following additions:

  • Adds support for Hadoop 1.0.3.

  • No longer includes Hadoop 0.18 and Hadoop 0.20.205.

Operating system: Debian 6.0.5 (Squeeze)

Applications: Hadoop 1.0.3, Hive 0.8.1.3, Pig 0.9.2.2, HBase 0.92.0

Languages: Perl 5.10.1, PHP 5.3.3, Python 2.6.6, R 2.11.1, Ruby 1.8.7

File system: ext3 for root, xfs for ephemeral

Kernel: Amazon Linux

6 August 2012
2.1.4

Same as 2.1.3, with the following additions:

30 August 2012
2.1.3

Same as 2.1.2, with the following additions:

  • Fixes issues in HBase.

6 August 2012
2.1.2

Same as 2.1.1, with the following additions:

  • Support for CloudWatch metrics when using MapR.

    Improve reliability of reporting metrics to CloudWatch.

6 August 2012
2.1.1

Same as 2.1.0, with the following additions:

  • Improves the reliability of log pushing.

  • Adds support for HBase in Amazon VPC.

  • Improves DNS retry functionality.

3 July 2012
2.1.0

Same as AMI 2.0.5, with the following additions:

  • Supports launching HBase clusters. For more information see Store Data with HBase.

  • Supports running MapR Editon M3 and Edition M5. For more information, see Using the MapR Distribution for Hadoop.

  • Enables HDFS append by default; dfs.support.append is set to true in hdfs/hdfs-default.xml. The default value in code is also set to true.

  • Fixes a race condition in instance controller.

  • Changes mapreduce.user.classpath.first to default to true. This configuration setting indicates whether to load classes first from the cluster's JAR file or the Hadoop system lib directory. This change was made to provide a way for you to easily override classes in Hadoop.

  • Uses Debian 6.0.5 (Squeeze) as the operating system.

12 June 2012
2.0.5

Note

Because of an issue with AMI 2.0.5, this version is deprecated. We recommend that you use a different AMI version instead.

Same as AMI 2.0.4, with the following additions:

  • Improves Hadoop performance by reinitializing the recycled compressor object for mappers only if they are configured to use the GZip compression codec for output.

  • Adds a configuration variable to Hadoop called mapreduce.jobtracker.system.dir.permission that can be used to set permissions on the system directory. For more information, see Setting Permissions on the System Directory.

  • Changes InstanceController to use an embedded database rather than the MySQL instance running on the box. MySQL remains installed and running by default.

  • Improves the collectd configuration. For more information about collectd, go to http://collectd.org/.

  • Fixes a rare race condition in InstanceController.

  • Changes the default shell from dash to bash.

  • Uses Debian 6.0.4 (Squeeze) as the operating system.

19 April 2012
2.0.4

Same as AMI 2.0.3, with the following additions:

  • Changes the default for fs.s3n.blockSize to 33554432 (32MiB).

  • Fixes a bug in reading zero-length files from Amazon S3.

30 January 2012
2.0.3

Same as AMI 2.0.2, with the following additions:

  • Adds support for Amazon EMR metrics in CloudWatch.

  • Improves performance of seek operations in Amazon S3.

24 January 2012
2.0.2

Same as AMI 2.0.1, with the following additions:

  • Adds support for the Python API Dumbo. For more information about Dumbo, go to https://github.com/klbostee/dumbo/wiki/.

  • The AMI now runs the Network Time Protocol Daemon (NTPD) by default. For more information about NTPD, go to http://en.wikipedia.org/wiki/Ntpd.

  • Updates the Amazon Web Services SDK to version 1.2.16.

  • Improves the way Amazon S3 file system initialization checks for the existence of Amazon S3 buckets.

  • Adds support for configuring the Amazon S3 block size to facilitate splitting files in Amazon S3. You set this in the fs.s3n.blockSize parameter. You set this parameter by using the configure-hadoop bootstrap action. The default value is 9223372036854775807 (8 EiB).

  • Adds a /dev/sd symlink for each /dev/xvd device. For example, /dev/xvdb now has a symlink pointing to it called /dev/sdb. Now you can use the same device names for AMI 1.0 and 2.0.

17 January 2012
2.0.1

Same as AMI 2.0 except for the following bug fixes:

  • Task attempt logs are pushed to Amazon S3.

  • Fixed /mnt mounting on 32-bit AMIs.

  • Uses Debian 6.0.3 (Squeeze) as the operating system.

19 December 2011
2.0.0

Operating system: Debian 6.0.2 (Squeeze)

Applications: Hadoop 0.20.205, Hive 0.7.1, Pig 0.9.1

Languages: Perl 5.10.1, PHP 5.3.3, Python 2.6.6, R 2.11.1, Ruby 1.8.7

File system: ext3 for root, xfs for ephemeral

Kernel: Amazon Linux

Note: Added support for the Snappy compression/decompression library.

11 December 2011
1.0.1

Same as AMI 1.0 except for the following change:

  • Updates sources.list to the new location of the Lenny distribution in archive.debian.org.

3 April 2012
1.0.0

Operating system: Debian 5.0 (Lenny)

Applications: Hadoop 0.20 and 0.18 (default); Hive 0.5, 0.7 (default), 0.7.1; Pig 0.3 (on Hadoop 0.18), 0.6 (on Hadoop 0.20)

Languages: Perl 5.10.0, PHP 5.2.6, Python 2.5.2, R 2.7.1, Ruby 1.8.7

File system: ext3 for root and ephemeral

Kernel: Red Hat

Note: This was the last AMI released before the CLI was updated to support AMI versioning. For backward compatibility, job flows launched with versions of the CLI downloaded before 11 December 2011 use this version.

26 April 2011

Note

The cc2.8xlarge instance type is supported only on AMI 2.0.0 or later. The hi1.4xlarge and hs1.8xlarge instance types are supported only on AMI 2.3 or later.