Menu
Amazon EMR
Amazon EMR Release Guide

Apache Tez

Apache Tez is a framework for creating a complex directed acyclic graph (DAG) of tasks for processing data. In some cases, it is used as an alternative to Hadoop MapReduce. For example, Pig and Hive workflows can run using Hadoop MapReduce or they can use Tez as an execution engine. For more information, see https://tez.apache.org/.

Tez Release Information for This Release of Amazon EMR

Application Amazon EMR Release Label Components installed with this application

Tez 0.8.4

emr-5.10.0

emrfs, emr-goodies, hadoop-client, hadoop-mapred, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-kms-server, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, hadoop-yarn-timeline-server, tez-on-yarn

Creating a Cluster with Tez

Install Tez by choosing that application when you create the cluster.

To launch a cluster with Tez installed using the console

The following procedure creates a cluster with Tez installed. For more information about launching clusters with the console, see Step 3: Launch an Amazon EMR Cluster in the Amazon EMR Management Guide.

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Choose Create cluster to use Quick Create.

  3. For Software Configuration, choose Amazon Release Version emr-4.7.0 or later.

  4. For Select Applications, choose either All Applications or Tez.

  5. Select other options as necessary and then choose Create cluster.

To launch a cluster with Tez using the AWS CLI

  • Create the cluster with the following command:

    aws emr create-cluster --name "Cluster with Tez" --release-label \ --applications Name=Tez --ec2-attributes KeyName=myKey \ --instance-type m3.xlarge --instance-count 3 --use-default-roles

    Note

    Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

Configuring Tez

You configure Tez by setting values in tez-site.xml using the tez-site configuration classification when you create your cluster. If you want to use Hive with Tez, you must also modify the hive-site configuration classification.

To change the root logging level in Tez

  • Create a cluster with Tez installed and set tez.am.log.level to DEBUG, using the following command:

    aws emr create-cluster --release-label --applications Name=Tez \ --instance-type m3.xlarge --instance-count 2 --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

    Note

    Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

    myConfig.json:

    [ { "Classification": "tez-site", "Properties": { "tez.am.log.level": "DEBUG" } } ]

    Note

    If you plan to store your configuration in Amazon S3, you must specify the URL location of the object. For example:

    aws emr create-cluster --release-label --applications Name=Tez Name=Hive \ --instance-type m3.xlarge --instance-count 3 --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

To change the Hive or Pig execution engine to Tez

  1. Create a cluster with Hive and Tez installed and set hive.execution.engine to tez, using the following command:

    aws emr create-cluster --release-label --applications Name=Tez Name=Hive \ --instance-type m3.xlarge --instance-count 2 --configurations https://s3.amazonaws.com/mybucket/myfolder/myConfig.json

    Note

    Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

    myConfig.json:

    [ { "Classification": "hive-site", "Properties": { "hive.execution.engine": "tez" } } ]
  2. To set the execution engine for Pig, modify pig.properties by setting myConfig.json to the following:

    [ { "Classification": "hive-site", "Properties": { "hive.execution.engine": "tez" } }, { "Classification": "pig-properties", "Properties": { "exectype": "tez" } } ]
  3. Create the cluster as above but add Pig as an application.

Using Tez

The following examples show you how to use Tez for the data and scripts used in the tutorial called Getting Started: Analyzing Big Data with Amazon EMR shown in Step 3.

Compare the Hive runtimes of MapReduce vs. Tez

  1. Create a cluster as shown in the procedure called To launch a cluster with Tez installed using the console. Choose Hive as an application in addition to Tez.

  2. Connect to the cluster using SSH. For more information, see Connect to the Master Node Using SSH.

  3. Run the Hive_CloudFront.q script using MapReduce with the following command, where region is the region in which your cluster is located:

    hive -f s3://region.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q \ -d INPUT=s3://region.elasticmapreduce.samples -d OUTPUT=s3://myBucket/mr-test/

    The output should look something like the following:

    <snip> Starting Job = job_1464200677872_0002, Tracking URL = http://ec2-host:20888/proxy/application_1464200677872_0002/ Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1464200677872_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-05-27 04:53:11,258 Stage-1 map = 0%, reduce = 0% 2016-05-27 04:53:25,820 Stage-1 map = 13%, reduce = 0%, Cumulative CPU 10.45 sec 2016-05-27 04:53:32,034 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 16.06 sec 2016-05-27 04:53:35,139 Stage-1 map = 40%, reduce = 0%, Cumulative CPU 18.9 sec 2016-05-27 04:53:37,211 Stage-1 map = 53%, reduce = 0%, Cumulative CPU 21.6 sec 2016-05-27 04:53:41,371 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 25.08 sec 2016-05-27 04:53:49,675 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 29.93 sec MapReduce Total cumulative CPU time: 29 seconds 930 msec Ended Job = job_1464200677872_0002 Moving data to: s3://myBucket/mr-test/os_requests MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 29.93 sec HDFS Read: 599 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 29 seconds 930 msec OK Time taken: 49.699 seconds
  4. Using a text editor, replace the hive.execution.engine value with tez in /etc/hive/conf/hive-site.xml.

  5. Kill the HiveServer2 process with the following command:

    sudo kill -9 $(pgrep -f HiveServer2)

    Upstart automatically restarts the Hive server with your configuration changes loaded.

  6. Now run the job with the Tez execution engine using the following command:

    hive -f s3://region.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q \ -d INPUT=s3://region.elasticmapreduce.samples -d OUTPUT=s3://myBucket/tez-test/

    The output should look something like the following:

    Time taken: 0.517 seconds Query ID = hadoop_20160527050505_dcdc075f-8338-4041-adc3-d2ffe69dfcdd Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1464200677872_0003) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 27.61 s -------------------------------------------------------------------------------- Moving data to: s3://myBucket/tez-test/os_requests OK Time taken: 30.711 seconds

    The time to run the same application took approximately 20 seconds (40%) less time using Tez.

Tez Web UI

Tez has its own web user interface. To view the web UI, see:

http://masterDNS:8080/tez-ui

Timeline Server

The YARN Timeline Service is configured to run when Tez is installed. To view jobs submitted through Tez or MapReduce execution engines using the timeline service, view the web UI:

http://masterDNS:8188

For more information, see View Web Interfaces Hosted on Amazon EMR Clusters in the Amazon EMR Management Guide.