Menu
Amazon EMR
Amazon EMR Release Guide

Using Presto with the AWS Glue Data Catalog

Using Amazon EMR release version 5.10.0 and later, you can specify the AWS Glue Data Catalog as the default Hive metastore for Presto. You can specify this option when you create a cluster using the AWS Management Console, or using the presto-connector-hive configuration classification when using the AWS CLI or Amazon EMR API. For more information, see Configuring Applications.

To specify the AWS Glue Data Catalog as the default Hive metastore using the console

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Choose Create cluster, Go to advanced options.

  3. Under Software Configuration choose a Release of emr-5.10-0 or later and select Presto.

  4. Select Use for Presto table metadata, choose Next, and then complete other settings for your cluster as appropriate for your application.

To specify the AWS Glue Data Catalog as the default Hive metastore using the CLI or API

  1. Amazon EMR 5.16.0 and later

    Set the hive.metastore property to glue as shown in the following JSON example

  2. [ { "Classification": "presto-connector-hive", "Properties": { "hive.metastore": "glue" } } ]

    Amazon EMR 5.10.0 through 5.15.0

    Set the hive.metastore.glue.datacatalog.enabled property to true, as shown in the following JSON example.

    [ { "Classification": "presto-connector-hive", "Properties": { "hive.metastore.glue.datacatalog.enabled": "true" } } ]

To switch metastores on a long-running cluster, you can manually set these values as appropriate for your release version by connecting to the master node, editing the property values in the /etc/presto/conf/catalog/hive.properties file directly, and restarting the Presto server (sudo restart presto-server). If you use this method on a version earlier than 5.16.0, make sure that hive.table-statistics-enabled is set to false. This setting is not required when using release versions 5.16.0 and later; nevertheless, table and partition statistics are not supported.

Considerations When Using AWS Glue Data Catalog

Consider the following items when using AWS Glue Data Catalog as a metastore with Presto:

  • Renaming tables from within AWS Glue is not supported.

  • When you create a Hive table without specifying a LOCATION, the table definition is stored in the location specified by the hive.metastore.warehouse.dir property. By default, this is a location in HDFS. If another cluster needs to access the table, it fails unless it has adequate permissions to the cluster that created the table. Furthermore, because HDFS storage is transient, if the cluster terminates, the table definition is lost, and the table must be recreated. We recommend that you specify a LOCATION in Amazon S3 when you create a Hive table using AWS Glue. Alternatively, you can use the hive-site configuration classification to specify a location in Amazon S3 for hive.metastore.warehouse.dir, which applies to all Hive tables. If a table is created in an HDFS location and the cluster that created it is still running, you can update the table location to Amazon S3 from within AWS Glue. For more information, see Working with Tables on the AWS Glue Console in the AWS Glue Developer Guide.

  • Partition values containing quotes and apostrophes are not supported (for example, PARTITION (owner="Doe's").

  • Column statistics are not supported.

  • Using Hive authorization is not supported.