Amazon EMR
Amazon EMR
Amazon EMR does all the work involved with provisioning, managing, and maintaining the
infrastructure and software of a Hadoop
Ideal usage patterns
Amazon EMR’s flexible framework reduces large processing problems and data sets into smaller jobs and distributes them across many compute nodes in a Hadoop cluster. This capability lends itself to many usage patterns with big data analytics. Here are a few examples:
-
Log processing and analytics
-
Large ETL data movement
-
Risk modeling and threat analytics
-
Ad targeting and click stream analytics
-
Genomics
-
Predictive analytics
-
Ad hoc data mining and analytics
For more information, see the
documentation for
Amazon EMR
Cost model
With Amazon EMR, you can launch a persistent cluster that stays up indefinitely, or a temporary cluster that ends after the analysis is complete. In either scenario, you pay only for the hours the cluster is up.
Amazon EMR supports a variety of Amazon EC2 instance types (standard, high CPU, high memory, high I/O, and so on) and all Amazon EC2 pricing options (On-Demand, Reserved, and Spot). When you launch an Amazon EMR cluster (also called a "job flow"), you choose how many and what type of Amazon EC2 instances to provision. The Amazon EMR price is in addition to the Amazon EC2 price.
For more information, see Amazon EMR
pricing.
Performance
Amazon EMR performance is driven by the type of EC2 instances on which you choose to run
your cluster, and how many you chose to run your analytics. You should choose an instance type
suitable for your processing requirements, with sufficient memory, storage, and processing
power. With the introduction of Graviton2 instances, you can see improved performance of up to
15% relative to equivalent previous generation instances. For more information about EC2
instance specifications, see Amazon EC2 Instance
Types
An important consideration when you create an EMR cluster is how you configure Amazon EC2 instances. EC2 instances in an EMR cluster are organized into node types.
The node types in Amazon EMR are as follows:
-
Primary node — A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. The primary node tracks the status of tasks and monitors the health of the cluster. Every cluster has a primary node, and it's possible to create a single-node cluster with only the primary node.
-
Core node — A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.
-
Task node — A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.
Durability and availability
By default, Amazon EMR is fault tolerant for core node failures and continues job execution if a dependent node goes down. Amazon EMR will also provision a new node when a core node fails. However, Amazon EMR will not replace nodes if all nodes in the cluster are lost.
You can monitor the health of nodes and replace failed nodes with Amazon CloudWatch. When you launch an Amazon EMR cluster, you can choose to have one or three primary nodes in your cluster. Launching a cluster with three primary nodes is only supported by Amazon EMR version 5.23.0 and later. EMR can take advantage of EC2 placement groups to ensure primary nodes are placed on distinct underlying hardware to further improve cluster availability.
For more information, see EMR integration with EC2 placement groups.
Scalability and elasticity
With Amazon EMR, it's easy to resize
a running cluster
You can also add and remove task nodes at any time which can process Hadoop jobs, but do not maintain HDFS. Some customers add hundreds of instances to their clusters when their batch processing occurs, and remove the extra instances when processing completes. For example, you may not know how much data your clusters will be handling in six months, or you may have spiky processing needs.
With Amazon EMR, you don't need to guess your future requirements or provision for peak demand because you can easily add or remove capacity at any time.
You can add all new clusters of various sizes and remove them at any time with a few clicks in the console or by a programmatic API call.
Additionally, you can configure instance fleets for a cluster to choose a wide variety of provisioning options for EC2 instances. With instance fleets you can specify target capacities on On-Demand Instances, and Spot Instances within each fleet. Amazon EMR tries to provide the capacity you need with the best mix of capacity and price based on your selection of Availability Zones.
While a cluster is running, if Amazon EC2 reclaims a Spot Instance because of a price increase, or an instance fails, Amazon EMR tries to replace the instance with any of the instance types that you specify. This makes it easier to regain capacity if a node is lost for any reason.
Interfaces
Amazon EMR supports many tools on top of Hadoop
Hive
Hive
Hive allows user extensions via user-defined functions written in Java. Amazon EMR has made numerous improvements to Hive, including direct integration with DynamoDB and Amazon S3. For example, with Amazon EMR you can load table partitions automatically from S3, you can write data to tables in S3 without using temporary files, and you can access resources in S3, such as scripts for custom map and/or reduce operations and additional libraries.
For more information, see Apache Hive in the Amazon EMR Release Guide.
Pig
Pig
Pig allows user extensions via user-defined functions written in Java. Amazon EMR has made
numerous improvements to Pig, including the ability to use multiple file systems (normally,
Pig can only access one remote file system), the ability to load customer JARs and scripts
from S3 (such as “REGISTER s3://my-bucket/piggybank.jar
”), and additional
functionality for String and DateTime processing.
For more information, see Apache Pig in the Amazon EMR Release Guide.
Spark
Spark
EMR features Amazon EMR runtime for Apache Spark
For more information, see Apache Spark
on Amazon EMR
HBase
HBase
HBase is optimized for sequential write operations, and it is highly efficient for batch inserts, updates, and deletes. HBase works seamlessly with Hadoop, sharing its file system and serving as a direct input and output to Hadoop jobs. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC). With Amazon EMR, you can back up HBase to Amazon S3 (full or incremental, manual or automated) and you can restore from a previously created backup.
For more information, see Apache HBase in the Amazon EMR Release Guide.
Presto
Presto
Hudi
Apache Hudi
Kinesis Connector
The Kinesis
Connector enables EMR to directly read and query data from Kinesis Data Streams. You
can perform batch processing of Kinesis streams using existing Hadoop tools such as
Hive, Pig, MapReduce
-
Streaming log analysis — You can analyze streaming web logs to generate a list of top ten error types every few minutes by Region, browser, and access domains.
-
Complex data processing workflows — You can join Kinesis stream with data stored in Amazon S3, Dynamo DB tables, and HDFS. You can write queries that join clickstream data from Kinesis with advertising campaign information stored in a DynamoDB table to identify the most effective categories of ads that are displayed on particular websites.
-
Ad-hoc queries — You can periodically load data from Kinesis into HDFS and make it available as a local Impala table
for fast, interactive, analytic queries.
Other third-party tools
Amazon EMR also supports a variety of other popular applications and tools in the Hadoop
system, such as R
Additionally, you can install your own software on top of Amazon EMR to help solve your
business needs. AWS provides the ability to quickly move large amounts of data from S3 to
HDFS, from HDFS to S3, and between S3 buckets using Amazon EMR’s S3DistCp, an
extension of the open source tool DistCp
You can optionally use the EMR File System (EMRFS), an implementation of HDFS which allows Amazon EMR clusters to store data on S3. You can enable S3 server-side and client-side encryption.
EMR Studio
EMR Studio is an integrated development environment (IDE) that makes it easy for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark.
EMR Studio provides fully managed Jupyter
Anti-patterns
Amazon EMR has the following anti-pattern:
-
Small datasets – Amazon EMR is built for massive parallel processing; if your data set is small enough to run quickly on a single machine, in a single thread, the added overhead to map and reduce jobs may not be worth it for small datasets that can easily be processed in memory on a single system. Amazon EMR or a relational database running on Amazon EMR may be a better option for workloads with stringent requirements.