Tutorial: Getting Started with Amazon EMR
Overview
With Amazon EMR, you can set up a cluster to process and analyze data with big data frameworks in just a few minutes. This tutorial shows you how to launch a sample cluster using Spark, and how to run a simple PySpark script that you'll store in an Amazon S3 bucket. It covers essential Amazon EMR tasks in three main workflow categories: Plan and Configure, Manage, and Clean Up. You can also adapt this process for your own workloads.

Prerequisites
-
Before you launch an Amazon EMR cluster, make sure you complete the tasks in Setting Up Amazon EMR.
This tutorial introduces you to the following Amazon EMR tasks:
You'll find links to more detailed topics as you work through this tutorial, and ideas
for
additional steps in the Next Steps section. If you have questions or get stuck,
contact the Amazon EMR team on our Discussion
Forum
Cost
-
The sample cluster that you create runs in a live environment. The cluster will accrue minimal charges and will only run for the duration of this tutorial as long as you complete the cleanup tasks. Charges accrue at the per-second rate for Amazon EMR pricing and vary by Region. For more information, see Amazon EMR Pricing
. -
Minimal charges might also accrue for small files that you store in Amazon S3 for the tutorial. Some or all of your charges for Amazon S3 might be waived if you are within the usage limits of the AWS Free Tier. For more information, see Amazon S3 Pricing
and AWS Free Tier .
Step 1: Plan and Configure an Amazon EMR Cluster
In this step, you plan for and launch a simple Amazon EMR cluster with Apache Spark installed. The setup process includes creating an Amazon S3 bucket to store a sample PySpark script, an input dataset, and cluster output.
Prepare Storage for Cluster Input and Output
Create an Amazon S3 bucket to store an example PySpark script, input data, and output data. Create the bucket in the same AWS Region where you plan to launch your Amazon EMR cluster. For example, US West (Oregon) us-west-2. Buckets and folders that you use with Amazon EMR have the following limitations:
-
Names can consist of only lowercase letters, numbers, periods (.), and hyphens (-).
-
Names cannot end in numbers.
-
A bucket name must be unique across all AWS accounts.
-
An output folder must be empty.
To create a bucket for this tutorial, see How do I create an S3 Bucket? in the Amazon Simple Storage Service Console User Guide.
Develop and Prepare an Application for Amazon EMR
In this step, you upload a sample PySpark script to Amazon S3. This is the most common way to prepare an application for Amazon EMR. EMR lets you specify the Amazon S3 location of the script when you submit work to your cluster. You also upload sample input data to Amazon S3 for the PySpark script to process.
We've provided the following PySpark script for you to use. The script processes food establishment inspection data and outputs a file listing the top ten establishments with the most "Red" type violations to your S3 bucket.
To prepare the example PySpark script for EMR
-
Copy the example code below into a new file in your editor of choice.
-
Save the file as
health_violations.py
. -
Upload
health_violations.py
to Amazon S3 into the bucket you designated for this tutorial. For information about how to upload objects to Amazon S3, see Uploading an object to a bucket in the Amazon Simple Storage Service Getting Started Guide.
import argparse from pyspark.sql import SparkSession def calculate_red_violations(data_source, output_uri): """ Processes sample food establishment inspection data and queries the data to find the top 10 establishments with the most Red violations from 2006 to 2020. :param data_source: The URI where the food establishment data CSV is saved, typically an Amazon S3 bucket, such as 's3://DOC-EXAMPLE-BUCKET/food-establishment-data.csv'. :param output_uri: The URI where the output is written, typically an Amazon S3 bucket, such as 's3://DOC-EXAMPLE-BUCKET/restaurant_violation_results'. """ with SparkSession.builder.appName("Calculate Red Health Violations").getOrCreate() as spark: # Load the restaurant violation CSV data if data_source is not None: restaurants_df = spark.read.option("header", "true").csv(data_source) # Create an in-memory DataFrame to query restaurants_df.createOrReplaceTempView("restaurant_violations") # Create a DataFrame of the top 10 restaurants with the most Red violations top_red_violation_restaurants = spark.sql("SELECT name, count(*) AS total_red_violations " + "FROM restaurant_violations " + "WHERE violation_type = 'RED' " + "GROUP BY name " + "ORDER BY total_red_violations DESC LIMIT 10 ") # Write the results to the specified output URI top_red_violation_restaurants.write.option("header", "true").mode("overwrite").csv(output_uri) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument( '--data_source', help="The URI where the CSV restaurant data is saved, typically an S3 bucket.") parser.add_argument( '--output_uri', help="The URI where output is saved, typically an S3 bucket.") args = parser.parse_args() calculate_red_violations(args.data_source, args.output_uri)
Input arguments
You must include values for the following arguments when you run the PySpark script as a step.
-
--data_source
– The Amazon S3 URI of the food establishment data CSV file. You will prepare this file below. -
--output_uri
– The URI of the Amazon S3 bucket where the output results will be saved.
The input data is a modified version of a publicly available food establishment inspection
dataset with Health Department inspection results in King County, Washington,
from
2006 to 2020. For more information, see King County Open Data: Food Establishment Inspection Data
name, inspection_result, inspection_closed_business, violation_type, violation_points 100 LB CLAM, Unsatisfactory, FALSE, BLUE, 5 100 PERCENT NUTRICION, Unsatisfactory, FALSE, BLUE, 5 7-ELEVEN #2361-39423A, Complete, FALSE, , 0
To prepare the sample input data for EMR
-
Download the zip file, food_establishment_data.zip.
-
Unzip the content and save it locally as
food_establishment_data.csv
. -
Upload the CSV file to the S3 bucket that you created for this tutorial. For step-by-step instructions, see How do I upload files and folders to an S3 bucket? in the Amazon Simple Storage Service Console User Guide.
For more information about setting up data for EMR, see Prepare Input Data.
Launch an Amazon EMR Cluster
Now that you've completed the prework, you can launch a sample cluster with Apache Spark installed using the latest Amazon EMR release.
If you created your AWS account after December 04, 2013, Amazon EMR sets up a cluster in the default Amazon Virtual Private Cloud (VPC) for your selected Region when none is specified.
For more information about reading the cluster summary, see View Cluster Status and Details. For information about cluster status, see Understanding the Cluster Lifecycle.
Step 2: Manage Amazon EMR Clusters
Now that your cluster is up and running, you can connect to it and manage it. You can also submit work to your running cluster to process and analyze data.
Submit Work to Amazon EMR
With your cluster up and running, you can submit health_violations.py
as a step. A step is a unit of cluster work made up of one or
more jobs. For example, you might submit a step to compute values, or to transfer
and process data.
You can submit multiple steps to accomplish a set of tasks on a cluster when you create the cluster, or after it's already running. For more information, see Submit Work to a Cluster.
For more information about the step lifecycle, see Running Steps to Process Data.
View Results
After a step runs successfully, you can view its output results in the Amazon S3 output folder you specified when you submitted the step.
To view the results of health_violations.py
-
Open the Amazon S3 console at https://console.aws.amazon.com/s3/
. -
Choose the Bucket name and then the output folder that you specified when you submitted the step. For example,
DOC-EXAMPLE-BUCKET
and thenmyOutputFolder
. -
Verify that the following items are in your output folder:
-
A small-sized object called
_SUCCESS
, indicating the success of your step. -
A CSV file starting with the prefix
part-
. This is the object with your results.
-
-
Choose the object with your results, then choose Download to save it to your local file system.
-
Open the results in your editor of choice. The output file lists the top ten food establishments with the most Red violations.
Following is an example of
health_violations.py
results.name, total_red_violations SUBWAY, 322 T-MOBILE PARK, 315 WHOLE FOODS MARKET, 299 PCC COMMUNITY MARKETS, 251 TACO TIME, 240 MCDONALD'S, 177 THAI GINGER, 153 SAFEWAY INC #1508, 143 TAQUERIA EL RINCONSITO, 134 HIMITSU TERIYAKI, 128
For more information about Amazon EMR cluster output, see Configure an Output Location.
(Optional) Set Up Cluster Connections
This step is not required, but you have the option to connect to cluster nodes with Secure Shell (SSH) for tasks like issuing commands, running applications interactively, and reading log files.
Configure security group rules
Before you connect to your cluster, you must set up a Port 22 inbound rule to allow SSH connections.
Security groups act as virtual firewalls to control inbound and outbound traffic to your cluster. When you create a cluster with the default security groups, Amazon EMR creates the following groups:
- ElasticMapReduce-master
-
The default Amazon EMR-managed security group associated with the master instance.
- ElasticMapReduce-slave
-
The default security group associated with core and task nodes.
To allow SSH access for trusted sources for the ElasticMapReduce-master security group
You must first be logged in to AWS as a root user or as an IAM principal that is allowed to manage security groups for the VPC that the cluster is in. For more information, see Changing Permissions for an IAM User and the Example Policy that allows managing EC2 security groups in the IAM User Guide.
-
Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/
. -
Choose Clusters.
-
Choose the Name of the cluster.
-
Under Security and access choose the Security groups for Master link.
-
Choose ElasticMapReduce-master from the list.
-
Choose Inbound, Edit.
-
Check for an inbound rule that allows public access with the following settings. If it exists, choose Delete to remove it.
-
Type
SSH
-
Port
22
-
Source
Custom 0.0.0.0/0
Warning Before December 2020, the default EMR-managed security group for the master instance in public subnets was created with a pre-configured rule to allow inbound traffic on Port 22 from all sources. This rule was created to simplify initial SSH connections to the master node. We strongly recommend that you remove this inbound rule and restrict traffic only from trusted sources.
-
-
Scroll to the bottom of the list of rules and choose Add Rule.
-
For Type, select SSH.
This automatically enters TCP for Protocol and 22 for Port Range.
-
For source, select My IP.
This automatically adds the IP address of your client computer as the source address. Alternatively, you can add a range of Custom trusted client IP addresses and choose Add rule to create additional rules for other clients. Many network environments dynamically allocate IP addresses, so you might need to periodically edit security group rules to update IP addresses for trusted clients.
-
Choose Save.
-
Optionally, choose ElasticMapReduce-slave from the list and repeat the steps above to allow SSH client access to core and task nodes from trusted clients.
Connect to the Cluster
After you configure your SSH rules, go to Connect to the Master Node Using SSH and follow the instructions to:
-
Retrieve the public DNS name of the node to which you want to connect.
-
Connect to your cluster using SSH.
For more information on how to authenticate to cluster nodes, see Authenticate to Amazon EMR Cluster Nodes.
Step 3: Clean Up Amazon EMR Cluster Resources
Now that you've submitted work to your cluster and viewed the results of your PySpark application, you can shut the cluster down and delete your designated Amazon S3 bucket to avoid additional charges.
Shut down your cluster
Shutting down a cluster stops all of its associated Amazon EMR charges and Amazon EC2 instances.
Amazon EMR retains metadata about your cluster for two months at no charge after you terminate the cluster. This makes it easy to clone the cluster for a new job or revisit its configuration for reference purposes. Metadata does not include data that the cluster might have written to S3, or that was stored in HDFS on the cluster while it was running.
The Amazon EMR console does not let you delete a cluster from the list view after you shut down the cluster. A terminated cluster disappears from the console when Amazon EMR clears its metadata.
Delete S3 resources
Delete the bucket you created earlier to remove all of the Amazon S3 objects used in this tutorial. This bucket should contain your input dataset, cluster output, PySpark script, and log files. You might need to take extra steps to delete stored files if you saved your PySpark script or output in an alternative location.
Your cluster must be completely shut down before you delete your bucket. Otherwise, you might run into issues when you try to empty the bucket.
Follow the instructions in How Do I Delete an S3 Bucket in the Amazon Simple Storage Service Getting Started Guide to empty your bucket and delete it from S3.
Next Steps
You've now launched your first Amazon EMR cluster from start to finish and walked through essential EMR tasks like preparing and submitting big data applications, viewing results, and shutting down a cluster.
Here are some suggested topics to learn more about tailoring your Amazon EMR workflow.
Explore Big Data Applications for Amazon EMR
Discover and compare the big data applications you can install on a cluster in the Amazon EMR Release Guide. The Release Guide also contains details about each EMR version and tips on how to configure and use frameworks such as Spark and Hadoop on Amazon EMR.
Plan Cluster Hardware, Networking, and Security
In this tutorial, you create a simple EMR cluster without configuring advanced options such as instance types, networking, and security. For more information on planning and launching a cluster that meets your speed, capacity, and security requirements, see Plan and Configure Clusters and Security in Amazon EMR.
Manage Clusters
Dive deeper into working with running clusters in Manage Clusters, which covers how to connect to clusters, debug steps, and track cluster activities and health. You can also learn more about adjusting cluster resources in response to workload demands with EMR managed scaling.
Use a Different Interface
In addition to the Amazon EMR console, you can manage Amazon EMR using the AWS Command Line Interface, the web service API, or one of the many supported AWS SDKs. For more information, see Management Interfaces.
There are many ways you can interact with applications installed on Amazon EMR clusters. Some applications like Apache Hadoop publish web interfaces that you can view on cluster instances. For more information, see View Web Interfaces Hosted on Amazon EMR Clusters. With Amazon EMR clusters running Apache Spark, you can use an EMR notebook in the Amazon EMR console to run queries and code. For more information, see Amazon EMR Notebooks.
Browse the EMR Technical Blog
For sample walkthroughs and in-depth technical discussion of EMR features, see the
AWS Big Data
Blog