Hive jobs - Amazon EMR

Hive jobs

You can run Hive jobs on an application with the type parameter set to 'HIVE'. Jobs must be compatible with the Hive version referenced in the Amazon EMR release version. For example, when you run jobs on an application with Amazon EMR release 6.6.0, your job must be compatible with Apache Hive 3.1.2.

When you use the start-job-run API to run a Hive job, you must specify the following parameters.

Job runtime role (executionRoleArn)

This is an IAM role ARN that your application uses to execute Hive jobs. This role must contain the following permissions:

  • Read from S3 buckets or other data sources where your data resides

  • Read from S3 buckets or prefixes where your Hive query file and init query file reside

  • Read and write to S3 buckets where your Hive Scratch directory and Hive Metastore warehouse directory reside

  • Write to S3 buckets where you intend to write your final output

  • Write logs to an S3 bucket or prefix that S3MonitoringConfiguration specifies

  • Access to KMS keys if you use KMS keys to encrypt data in your S3 bucket

  • Access to the AWS Glue Data Catalog

If your Hive job reads or writes data to or from other data sources, specify the appropriate permissions in this IAM role. If you don't provide these permissions to the IAM role, your job might fail. For more information, see Job runtime roles.

Job driver (jobDriver)

A job's driver provides input to the job. This parameter accepts only one value for the job type that you want to run. When you specify hive as the job type, a Hive query is passed to the job-driver parameter . This job type has the following parameters:

  • query – This is the reference in Amazon S3 to the Hive query file that you want to run.

  • parameters – These are the additional Hive configuration properties that you want to override. To override properties, pass them to this parameter as --hiveconf property=value. To override variables, pass them to this parameter as --hivevar key=value.

  • initQueryFile – This is the init Hive query file. It will be executed prior to your query and can be used to initialize tables.

Configuration overrides (configurationOverrides)

Use this parameter to override application and monitoring level configuration properties. This parameters accepts a JSON object with the following two fields:

  • applicationConfiguration – You can provide a configuration object in this field to override the default configurations for applications. You can use a shorthand syntax to provide the configuration, or you can reference the configuration object in a JSON file. Configuration objects consist of a classification, properties, and optional nested configurations. Properties consist of the settings that you want to override in that file. You can specify multiple classifications for multiple applications in a single JSON object. The configuration classifications that are available vary by specific release version for Amazon EMR. For a list of configuration classifications that are available for each release version of Amazon EMR, see Release versions.

    If you pass the same configuration in an application override and in Hive parameters, the Hive parameters take priority. The following list ranks configurations from highest priority to lowest priority.

    • Configuration that you provide as part of Hive parameters with --hiveconf property=value.

    • Configuration that you provide as part of application overrides.

    • Optimized configurations that Amazon EMR assigns for the release.

    • Default open source configurations for the application.

  • monitoringConfiguration – Use this field to specify the Amazon S3 URL (s3MonitoringConfiguration) where you want the EMR Serverless job to store logs of your Hive job. Make sure that you create this bucket with the same AWS account that hosts your application, and in the same AWS Region where your job is running.

Hive job properties

The following table lists the mandatory properties that you must configure when you submit a Hive job.

Setting Description
hive.exec.scratchdir The Amazon S3 location where temporary files are created during the Hive job execution.
hive.metastore.warehouse.dir The Amazon S3 location of databases for managed tables in Hive.

The following table lists the optional Hive properties and their default values that you can override when you submit a Hive job.

Setting Description Default value
hive.driver.memory The amount of memory to use per Hive driver process. The Hive CLI and Tez Application Master share this memory equally with 20% of headroom. 6G
hive.driver.cores The number of cores to use for the Hive driver process. 2
hive.driver.disk The disk size for the Hive driver. 21G
hive.metastore.client.factory.class The name of the factory class that produces objects that implement the IMetaStoreClient interface. com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
hive.metastore.glue.catalogid If the AWS Glue Data Catalog acts as a metastore but runs in a different AWS account than the jobs, the ID of the AWS account where the jobs are running. NULL
javax.jdo.option.ConnectionDriverName The driver class name for a JDBC metastore. org.apache.derby.jdbc.EmbeddedDriver
javax.jdo.option.ConnectionURL The JDBC connect string for a JDBC metastore. jdbc:derby:;databaseName=metastore_db;create=true
javax.jdo.option.ConnectionUserName The user name associated with a metastore database. NULL
javax.jdo.option.ConnectionPassword The password associated with a metastore database. NULL
hive.metastore.uris The thrift URI that the metastore client uses to connect to remote metastore. NULL
hive.tez.disk.size The disk size for each task container. 21G
hive.prewarm.enabled Option that turns on container prewarm for Tez. FALSE
hive.prewarm.numcontainers The number of containers to pre-warm for Tez. 10
hive.tez.container.size The amount of memory to use per Tez task process. 6144
hive.tez.cpu.vcores The number of cores to use for each Tez task. 2
hive.max-task-containers The maximum number of concurrent containers. The configured mapper memory is multiplied by this value to determine available memory that split computation and task preemption use. 1000
hive.exec.reducers.max The maximum number of reducers. 256 A join converts directly to a mapjoin below this size. Optimal value is calculated based on Tez task memory The size of the soft buffer when output is sorted. Optimal value is calculated based on Tez task memory
tez.runtime.unordered.output.buffer.size-mb The size of the buffer to use if not writing directly to disk. Optimal value is calculated based on Tez task memory The maximum number of attempts that can fail for a particular task before the task fails. This number doesn't count manually terminated attempts. 3
hive.exec.stagingdir The name of the directory that stores temporary files that will be created inside table locations and in the scratch directory location specified in the hive.exec.scratchdir property. .hive-staging
hive.compute.query.using.stats Option that activates Hive to answer certain queries with statistics stored in the metastore. For basic statistics, set hive.stats.autogather to true. For a more advanced collection of queries, run analyze table queries. TRUE
hive.vectorized.execution.enabled Option that turns on vectorized mode of query execution. TRUE
hive.cbo.enable Option that turns on cost-based optimizations with the Calcite framework. TRUE Option that turns on the Tez auto-reducer parallelism feature. Hive still estimates data sizes and sets parallelism estimates. Tez samples the output sizes of source vertices and adjusts the estimates at runtime as necessary. FALSE
hive.stats.fetch.column.stats Option that turns off the fetch of column statistics from the metastore. A fetch of column statistics can be expensive when the number of columns is high. FALSE
hive.vectorized.execution.reduce.enabled Option that turns on vectorized mode of a query execution's reduce-side. TRUE
hive.exec.max.dynamic.partitions.pernode Maximum number of dynamic partitions allowed to be created in each mapper and reducer node. 100
hive.exec.max.dynamic.partitions The maximum number of dynamic partitions allowed to be created in total. 1000 Option that turns on optimization in converting a common join into a mapjoin based on the input file size. TRUE
hive.exec.dynamic.partition.mode In strict mode, you must specify at least one static partition in case you accidentally overwrite all partitions. In non-strict mode, all partitions are allowed to be dynamic. strict
hive.merge.tezfiles Option that turns on a merge of small files at the end of a Tez DAG. FALSE
hive.strict.checks.cartesian.product Options that turns on strict Cartesian join checks. These checks disallow a Cartesian product (a cross join). FALSE
hive.stats.autogather Option that causes basic statistics to be gathered automatically during the INSERT OVERWRITE command. TRUE
hive.exec.orc.split.strategy Expects one of the following values: BI, ETL, or HYBRID. This isn’t a user-level configuration. BI specifies that you want to spend less time in split generation as opposed to query execution (split generation does not read or cache file footers). ETL specifies that you want to spend more time in split generation (split generation reads and caches file footers). HYBRID specifies a choice of the above strategies based on heuristics. HYBRID Option that turns on auto-conversion of common joins into mapjoins, based on the input file size. TRUE
hive.default.fileformat The default file format for CREATE TABLE statements. You can explicitly override this if you specify STORED AS [FORMAT] in your CREATE TABLE command. TEXTFILE
hive.exec.reducers.bytes.per.reducer The size per reducer. The default is 256 MB. If the input size is 1G, the job uses 4 reducers. 256000000
hive.exec.dynamic.partition Options that turns on dynamic partitions in DML/DDL. TRUE
hive.merge.size.per.task The size of merged files at the end of the job. 256000000
hive.merge.mapfiles Option that causes small files to merge at the end of a map-only job. TRUE
hive.fetch.task.conversion Expects one of the following values: NONE, MINIMAL, or MORE. Some select queries can be converted to a single FETCH task. This minimizes latency. MORE
hive.stats.gather.num.threads The number of threads that the partialscan and noscan analyze commands use for partitioned tables. This only applies to file formats that implement StatsProvidingRecordReader (like ORC). 10
hive.optimize.ppd Option that turns on predicate pushdown. TRUE
hive.input.format The default input format. Set to HiveInputFormat if you encounter problems with CombineHiveInputFormat. Option that turns on predicate pushdown to storage handlers. TRUE
hive.groupby.position.alias Enables using a column position alias in GROUP BY statements. FALSE
hive.orderby.position.alias Enables using a column position alias in ORDER BY statements. TRUE
hive.mapred.reduce.tasks.speculative.execution Option that turns on speculative launch for reducers. TRUE Expects value of NONE or COLUMN]. NONE implies only alphanumeric and underscore characters are valid in identifiers. COLUMN implies column names can contain any character. COLUMN
hive.tez.min.partition.factor Lower limit of reducers that Tez specifies when you turn on auto-reducer parallelism. 0.25 Option that turns on strict type safety checks and turns off comparison of bigint with both string and double. TRUE
hive.log.explain.output Option that turns on explanations of extended output for any query in your Hive log. true
hive.log.level The Hive logging level. INFO The root logging level passed to the Tez app master. INFO
tez.task.log.level The root logging level passed to the Tez tasks. INFO
tez.grouping.max-size The upper size limit (in bytes) of a grouped split. This limit prevents excessively large splits. 1073741824
tez.grouping.min-size The lower size limit (in bytes) of a grouped split. This limit prevents too many small splits. 52428800
tez.shuffle-vertex-manager.min-src-fraction The fraction of source tasks that must complete before tasks for the current vertex are scheduled (in case of a ScatterGather connection). 0.25
tez.shuffle-vertex-manager.max-src-fraction The fraction of source tasks that must complete before all tasks on the current vertex can be scheduled (in case of a ScatterGather connection). The number of tasks ready for scheduling on the current vertex scales linearly between min-fraction and max-fraction. This defaults the default value or tez.shuffle-vertex-manager.min-src-fraction, whichever is greater. 0.75 Option that causes speculative launch of slower tasks. This can help reduce job latency when some tasks are running slower due bad or slow machines. FALSE Option that turns on cleanup of shuffle data when DAG completes. TRUE
tez.client.asynchronous-stop Enables pushing of ATS events before ending the Hive driver. FALSE The amount of time after which ATS events should be pushed upon AM shutdown request. 0
tez.yarn.ats.event.flush.timeout.millis The maximum amount of time that AM should wait for events to be flushed before shutting down. 300000

Hive job examples

The following code example shows how to run a Hive query with the StartJobRun API.

aws emr-serverless start-job-run \ --application-id application-id \ --execution-role-arn job-role-arn \ --job-driver '{ "hive": { "query": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/query/hive-query.ql", "parameters": "--hiveconf hive.log.explain.output=false" } }' \ --configuration-overrides '{ "applicationConfiguration": [{ "classification": "hive-site", "properties": { "hive.exec.scratchdir": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/hive/scratch", "hive.metastore.warehouse.dir": "s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/hive/warehouse", "hive.driver.cores": "2", "hive.driver.memory": "4g", "hive.tez.container.size": "4096", "hive.tez.cpu.vcores": "1" } }] }'

You can find additional examples of how to run Hive jobs in the EMR Serverless Samples GitHub repository.