Menu
Amazon EMR
Developer Guide

Using the AWS Glue Data Catalog as the Metastore for Hive

Using Amazon EMR version 5.8.0 or later, you can configure Hive to use the AWS Glue Data Catalog as its metastore. We recommend this configuration when you require a persistent metastore or a metastore shared by different clusters, services, and applications.

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. The AWS Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore. AWS Glue crawlers can automatically infer schema from source data in Amazon S3 and store the associated metadata in the Data Catalog. For more information about the Data Catalog, see Populating the AWS Glue Data Catalog in the AWS Glue Developer Guide.

Separate charges apply for AWS Glue. There is a monthly rate for storing and accessing the metadata in the Data Catalog, an hourly rate billed per minute for AWS Glue ETL jobs and crawler runtime, and an hourly rate billed per minute for each provisioned development endpoint. The Data Catalog allows you to store up to a million objects at no charge. If you store more than a million objects, you are charged USD$1 for each 100,000 objects over a million. An object in the Data Catalog is a table, partition, or database. For more information, see Glue Pricing.

Important

If you created tables using Amazon Athena or Amazon Redshift Spectrum before August 14, 2017, databases and tables are stored in an Athena-managed catalog, which is separate from the AWS Glue Data Catalog. To integrate Amazon EMR with these tables, you must upgrade to the AWS Glue Data Catalog. For more information, see Upgrading to the AWS Glue Data Catalog in the Amazon Athena User Guide.

Specifying AWS Glue Data Catalog as the Metastore

You can specify the AWS Glue Data Catalog as the metastore using the AWS Management Console, AWS CLI, or Amazon EMR API. When you create a cluster using the CLI or API, you use the hive-site configuration classification to specify the Data Catalog. When you create a cluster using the console, you can specify the Data Catalog using Advanced Options or Quick Options.

Note

The option to use the Data Catalog is also available with HCatalog because Hive is installed with HCatalog.

To specify AWS Glue Data Catalog as the metastore using the console

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Choose Create cluster, Go to advanced options.

  3. For Release, choose emr-5.8.0 or later.

  4. Under Release, select Hive or HCatalog.

  5. Under AWS Glue Data Catalog settings select Use for Hive table metadata.

  6. Choose other options for your cluster as appropriate, choose Next, and then configure other cluster options as appropriate for your application.

To specify the AWS Glue Data Catalog as the metastore using the AWS CLI or Amazon EMR API

  • Specify the value for hive.metastore.client.factory.class using the hive-site classification as shown in the following example. For more information, see Configuring Applications.

    Example Configuration JSON for Using the AWS Glue Data Catalog

    Copy
    [ { "Classification": "hive-site", "Properties": { "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory" } }, ]

IAM Permissions

The EMR_EC2_DefaultRole must be allowed IAM permissions for AWS Glue actions. This is only a concern if you don't use the default AmazonElasticMapReduceforEC2Role managed policy and you attach a customer-managed policy to the role. In this case, you need to configure the policy to allow permission to perform AWS Glue actions. Open the IAM console (https://console.aws.amazon.com/iam/) and view the contents of the AmazonElasticMapReduceforEC2Role managed policy to see the required AWS Glue actions to allow.

Unsupported Configurations, Functions, and Known Issues

The limitations listed below apply when using the AWS Glue Data Catalog as a metastore:

  • Adding auxiliary JARs using the Hive shell is not supported. As a workaround, add auxiliary JARs into the Hive classpath (specified using hive.aux.jars.path).

  • Renaming tables from within AWS Glue is not supported.

  • Partition values containing quotes and apostrophes are not supported (for example, PARTITION (owner="Doe's").

  • Table and partition statistics are not supported.

  • Hive constraints are not supported.

  • Using Hive authorization is not supported.

  • Setting hive.metastore.partition.inherit.table.properties is not supported.

  • Using the following metastore constants is not supported: BUCKET_COUNT, BUCKET_FIELD_NAME, DDL_TIME, FIELD_TO_DIMENSION, FILE_INPUT_FORMAT, FILE_OUTPUT_FORMAT, HIVE_FILTER_FIELD_LAST_ACCESS, HIVE_FILTER_FIELD_OWNER, HIVE_FILTER_FIELD_PARAMS, IS_ARCHIVED, META_TABLE_COLUMNS, META_TABLE_COLUMN_TYPES, META_TABLE_DB, META_TABLE_LOCATION, META_TABLE_NAME, META_TABLE_PARTITION_COLUMNS, META_TABLE_SERDE, META_TABLE_STORAGE, ORIGINAL_LOCATION.

  • When you use a predicate expression, explicit values must be on the right side of the comparison operator, or queries might fail.

    • Correct: SELECT * FROM mytable WHERE time > 11

    • Incorrect: SELECT * FROM mytable WHERE 11 > time

  • Using user-defined functions (UDFs) in predicate expressions is not recommended. Queries may fail because of the way Hive tries to optimize query execution.