Amazon EMR
Amazon EMR Release Guide

About Amazon EMR Releases

This documentation is for versions 4.x and 5.x of Amazon EMR. For information about Amazon EMR AMI versions 2.x and 3.x, see the Amazon EMR Developer Guide (PDF).

This document provides information about Amazon EMR 4.x and 5.x software releases. A release is a set of software applications and components which can be installed and configured on an Amazon EMR cluster. Amazon EMR releases are packaged using a system based on Apache BigTop, which is an open source project associated with the Hadoop ecosystem. In addition to Hadoop and Spark ecosystem projects, each Amazon EMR release provides components which enable cluster and resource management, interoperability with other AWS services, and additional configuration optimizations for installed software.


Each Amazon EMR release contains several distributed applications available for installation on your cluster. Amazon EMR defines each application as not only the set of the components which comprise that open source project but also a set of associated components which are required for that the application to function. When you choose to install an application using the console, API, or CLI, Amazon EMR installs and configures this set of components across nodes in your cluster. The following applications are supported for this release: Flink, Ganglia, Hadoop, HBase, HCatalog, Hive, Hue, Mahout, Oozie, Phoenix, Pig, Presto, Spark, Sqoop, Tez, Zeppelin, and ZooKeeper.


The Amazon EMR releases include various components that can be installed by specifying an application which uses them. The versions of these components are typically those found in the community. Amazon EMR makes an effort to make community releases available in a timely fashion. However, there may be a need to make changes to specific components. If those components are modified, they have a release version such as the following:


As an example, assume that the component, ExampleComponent1, has not been modified by Amazon EMR; the version is 1.0, which is the community version. However, another component, ExampleComponent2, is modified and its Amazon EMR release version is 1.0.0-amzn-0.

There are also components provided exclusively by Amazon EMR. For example, the DynamoDB connector component, emr-ddb, is provided by Amazon EMR for use with applications running on Amazon EMR clusters. Amazon components have just one version number. For example, an emr-ddb version is 2.1.0. For more information about using Hive to query DynamoDB and an example, see Amazon EMR Hive Queries to Accommodate Partial DynamoDB Schemas.

The following components are included with Amazon EMR:

emr-ddb4.2.0Amazon DynamoDB connector for Hadoop ecosystem applications.
emr-goodies2.2.0Extra convenience libraries for the Hadoop ecosystem.
emr-kinesis3.2.0Amazon Kinesis connector for Hadoop ecosystem applications.
emr-s3-dist-cp2.4.0Distributed copy application optimized for Amazon S3.
emrfs2.13.0Amazon S3 connector for Hadoop ecosystem applications.
flink-client1.1.3Apache Flink command line client scripts and applications.
ganglia-monitor3.7.2Embedded Ganglia agent for Hadoop ecosystem applications along with the Ganglia monitoring agent.
ganglia-metadata-collector3.7.2Ganglia metadata collector for aggregating metrics from Ganglia monitoring agents.
ganglia-web3.7.1Web application for viewing metrics collected by the Ganglia metadata collector.
hadoop-client2.7.3-amzn-1Hadoop command-line clients such as 'hdfs', 'hadoop', or 'yarn'.
hadoop-hdfs-datanode2.7.3-amzn-1HDFS node-level service for storing blocks.
hadoop-hdfs-library2.7.3-amzn-1HDFS command-line client and library
hadoop-hdfs-namenode2.7.3-amzn-1HDFS service for tracking file names and block locations.
hadoop-httpfs-server2.7.3-amzn-1HTTP endpoint for HDFS operations.
hadoop-kms-server2.7.3-amzn-1Cryptographic key management server based on Hadoop's KeyProvider API.
hadoop-mapred2.7.3-amzn-1MapReduce execution engine libraries for running a MapReduce application.
hadoop-yarn-nodemanager2.7.3-amzn-1YARN service for managing containers on an individual node.
hadoop-yarn-resourcemanager2.7.3-amzn-1YARN service for allocating and managing cluster resources and distributed applications.
hadoop-yarn-timeline-server2.7.3-amzn-1Service for retrieving current and historical information for YARN applications.
hbase-hmaster1.2.3Service for an HBase cluster responsible for coordination of Regions and execution of administrative commands.
hbase-region-server1.2.3Service for serving one or more HBase regions.
hbase-client1.2.3HBase command-line client.
hbase-rest-server1.2.3Service providing a RESTful HTTP endpoint for HBase.
hbase-thrift-server1.2.3Service providing a Thrift endpoint to HBase.
hcatalog-client2.1.0-amzn-0The 'hcat' command line client for manipulating hcatalog-server.
hcatalog-server2.1.0-amzn-0Service providing HCatalog, a table and storage management layer for distributed applications.
hcatalog-webhcat-server2.1.0-amzn-0HTTP endpoint providing a REST interface to HCatalog.
hive-client2.1.0-amzn-0Hive command line client.
hive-metastore-server2.1.0-amzn-0Service for accessing the Hive metastore, a semantic repository storing metadata for SQL on Hadoop operations.
hive-server2.1.0-amzn-0Service for accepting Hive queries as web requests.
hue-server3.10.0-amzn-0Web application for analyzing data using Hadoop ecosystem applications
mahout-client0.12.2Library for machine learning.
mysql-server5.5.52MySQL database server.
oozie-client4.2.0Oozie command-line client.
oozie-server4.2.0Service for accepting Oozie workflow requests.
phoenix-library4.7.0-HBase-1.2The phoenix libraries for server and client
phoenix-query-server4.7.0-HBase-1.2A light weight server providing JDBC access as well as Protocol Buffers and JSON format access to the Avatica API
presto-coordinator0.157.1Service for accepting queries and managing query execution among presto-workers.
presto-worker0.157.1Service for executing pieces of a query.
pig-client0.16.0-amzn-0Pig command-line client.
spark-client2.0.2Spark command-line clients.
spark-history-server2.0.2Web UI for viewing logged events for the lifetime of a completed Spark application.
spark-on-yarn2.0.2In-memory execution engine for YARN.
spark-yarn-slave2.0.2Apache Spark libraries needed by YARN slaves.
sqoop-client1.4.6Apache Sqoop command-line client.
tez-on-yarn0.8.4The tez YARN application and libraries.
webserver2.4.23Apache HTTP server.
zeppelin-server0.6.2Web-based notebook that enables interactive data analytics.
zookeeper-server3.4.9Centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
zookeeper-client3.4.9ZooKeeper command line client.

Learn More

If you are looking for additional information, see the following guides and sites: