Amazon EMR
Amazon EMR Release Guide

Apache Spark

Apache Spark is a distributed processing framework and programming model that helps you do machine learning, stream processing, or graph analytics using Amazon EMR clusters. Similar to Apache Hadoop, Spark is an open-source, distributed processing system commonly used for big data workloads. However, Spark has several notable differences from Hadoop MapReduce. Spark has an optimized directed acyclic graph (DAG) execution engine and actively caches data in-memory, which can boost performance, especially for certain algorithms and interactive queries.

Spark natively supports applications written in Scala, Python, and Java. It also includes several tightly integrated libraries for SQL (Spark SQL), machine learning (MLlib), stream processing (Spark Streaming), and graph processing (GraphX). These tools make it easier to leverage the Spark framework for a wide variety of use cases.

You can install Spark on an EMR cluster along with other Hadoop applications, and it can also leverage the EMR file system (EMRFS) to directly access data in Amazon S3. Hive is also integrated with Spark so that you can use a HiveContext object to run Hive scripts using Spark. A Hive context is included in the spark-shell as sqlContext.

For an example tutorial of setting up an EMR cluster with Spark and analyzing a sample data set, see New — Apache Spark on Amazon EMR on the AWS News blog.

To view a machine learning example using Spark on Amazon EMR, see the Large-Scale Machine Learning with Spark on Amazon EMR on the AWS Big Data blog.

Important

Apache Spark version 2.3.1, available beginning with Amazon EMR release version 5.16.0, addresses CVE-2018-8024 and CVE-2018-1334. We recommend that you migrate earlier versions of Spark to Spark version 2.3.1 or later.

The following table lists the version of Spark included in the latest release of Amazon EMR, along with the components that Amazon EMR installs with Spark.

For the version of components installed with Spark in this release, see Release 5.19.0 Component Versions.

Spark Version Information for emr-5.19.0

Amazon EMR Release Label Spark Version Components Installed With Spark

emr-5.19.0

Spark 2.3.2

aws-sagemaker-spark-sdk, emrfs, emr-goodies, emr-ddb, emr-s3-select, hadoop-client, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-httpfs-server, hadoop-kms-server, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, hadoop-yarn-timeline-server, livy-server, nginx, r, spark-client, spark-history-server, spark-on-yarn, spark-yarn-slave