Configuring Tez
You can customize Tez by setting values using the tez-site
configuration
classification, which configures settings in the tez-site.xml
configuration
file. For more information, see TezConfigurationhive-site
and
pig-properties
configuration classifications as appropriate. Examples
are shown below.
Example configuration
Example: Customizing the Tez root logging level and setting Tez as the execution engine for Hive and Pig
The example create-cluster
command shown below creates a cluster
with Tez, Hive, and Pig installed. The command references a file stored in Amazon S3,
myConfig.json
, which specifies properties for the
tez-site
classification that sets tez.am.log.level
to DEBUG
, and sets the execution engine to Tez for Hive and Pig
using the hive-site
and pig-properties
configuration
classifications.
Note
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
aws emr create-cluster --release-label
emr-7.2.0
\ --applications Name=Tez Name=Hive Name=Pig --ec2-attributes KeyName=myKey
\ --instance-type m5.xlarge --instance-count 3 \ --configurations https://s3.amazonaws.com/amzn-s3-demo-bucket/myfolder/myConfig.json --use-default-roles
Example contents of myConfig.json
are shown below.
[ { "Classification": "tez-site", "Properties": { "tez.am.log.level": "DEBUG" } }, { "Classification": "hive-site", "Properties": { "hive.execution.engine": "tez" } }, { "Classification": "pig-properties", "Properties": { "exectype": "tez" } } ]
Note
With Amazon EMR version 5.21.0 and later, you can override cluster configurations and specify additional configuration classifications for each instance group in a running cluster. You do this by using the Amazon EMR console, the AWS Command Line Interface (AWS CLI), or the AWS SDK. For more information, see Supplying a Configuration for an Instance Group in a Running Cluster.
Tez asynchronous split opening
When there is a large number of small files in the table path, and a query
attempts to read them all, each small file that corresponds to each individual split
gets combined under one Tez grouped split. A single mapper then
processes the single Tez grouped split. Since the execution is synchronous, each
individual split under the grouped split gets processed one by one. This requires
RecordReader
objects to synchronously process the splits.
Name | Classification | Description |
---|---|---|
|
|
Specifies the number of daemon threads that Tez uses to
pre-initiate the |
|
|
Specifies the number of |
Benchmarking for Tez asynchronous split opening
We used the following environments and configurations for benchmarking the Tez asynchronous split opening capability:
-
Benchmark environment – Amazon EMR cluster with 1 primary node that uses m5.16xlarge, and 16 core nodes that use m5.16xlarge.
-
Benchmark configurations – To simulate the scenario for benchmarking where a large number of input splits are in a single Tez grouped split,
tez.grouping.split-count
is set to1
. -
Table used for benchmarking – The table contains 200 partitions, with each partition containing a single file. The benchmark is done for when that table contains CSV files, and when that table contains parquet files. Hive query for benchmarking:
SELECT COUNT(*)
from the table ten times, and take the average runtime. -
Configurations to enable Tez async split opening – As follows:
-
tez.grouping.split.init.threads
=4
-
tez.grouping.split.init.recordreaders
=10
-
Dataset | Feature disabled (baseline) | Feature enabled | Improvement |
---|---|---|---|
CSV dataset |
90.26 seconds |
79.20 seconds |
12.25% |
Parquet dataset |
54.67 seconds |
42.23 seconds |
22.75% |