Common EMR Applications - Teaching Big Data Skills with Amazon EMR

Common EMR Applications

Amazon EMR makes it simple to provision Hadoop infrastructure, but also simplifies the deployment of popular distributed applications such as Apache Spark, Apache Pig, and Apache Zeppelin. This document details three deployment strategies to provision EMR clusters that support these applications. For a full list of supported applications, see Amazon EMR 5.x Release Versions.

This document focuses on a few key applications that are relevant to teaching an introduction to big data with EMR.

Apache Spark

Apache Spark is a unified analytics engine used in large-scale data processing. In simple terms, Spark allows users to run SQL queries and create data frames for analysis using various common languages, mainly Java, Python and Scala. Spark is a native component of EMR that is available to be automatically provisioned when deploying an Amazon EMR cluster.

Apache Pig

Apache Pig is an open-source Apache library that runs on top of Hadoop. It provides a scripting language used to modify large data sets with a high-level scripting language. Apache Pig enables the user to take commands that are similar to SQL (written in Pig Latin), and convert them to Tez jobs for execution in the Hadoop environment. Apache Pig works with structured and unstructured data in a variety of formats.

Apache Zeppelin with Shiro

Apache Zeppelin is an open-source, multi-language, web-based notebook that allows users to use various data processing back-ends provided by Amazon EMR. Zeppelin is flexible enough to provide functionality for data ingestion, discovery, analytics, and visualization. Zeppelin is included in Amazon EMR 5.0 and later, and provides built-in integration for Apache Spark. Configuration for how to set up Zeppelin is provided in the Setting Up Access to Zeppelin Using Linux Credentials section.

For multi-tenant deployments of Zeppelin, Apache Shiro is recommended as the authentication method. Shiro is a Java security framework that performs authentication, authorization, cryptography, and session management for Zeppelin notebooks.