Menu
Amazon EMR
Amazon EMR Release Guide

Presto

Use Presto as a fast SQL query engine for large data sources. For more information, see the Presto website.

Presto Release Information for This Release of Amazon EMR

Application Amazon EMR Release Label Components installed with this application

Presto 0.187

emr-5.11.0

emrfs, emr-goodies, hadoop-client, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-kms-server, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, hadoop-yarn-timeline-server, hive-client, hcatalog-server, mysql-server, presto-coordinator, presto-worker

Limitations and Known Issues with Presto on Amazon EMR

  • Certain Presto properties or properties that pertain to Presto cannot be configured directly with the configuration API. You can configure log.properties and config.properties. However, the following properties cannot be configured:

    • node.properties (configurable in Amazon EMR version 5.6.0 and later)

    • jvm.config

    For more information about these configuration files, see the Presto documentation.

  • Presto is not configured to use EMRFS. Instead, it uses PrestoS3FileSystem.

  • You can access the Presto web interface on the Presto coordinator using port 8889.

Using Presto with the AWS Glue Data Catalog

Using Amazon EMR release version 5.10.0 and later, you can specify the AWS Glue Data Catalog as the default Hive metastore for Presto. You can specify this option when you create a cluster using the AWS Management Console, or using the presto-connector-hive configuration classification when using the AWS CLI or Amazon EMR API. For more information, see Configuring Applications.

To specify the AWS Glue Data Catalog as the default Hive metastore using the console

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Choose Create cluster, Go to advanced options.

  3. Under Software Configuration choose a Release of emr-5.10-0 or later and select Presto.

  4. Select Use for Presto table metadata, choose Next, and then complete other settings for your cluster as appropriate for your application.

To specify the AWS Glue Data Catalog as the default Hive metastore using the CLI or API

  • Set the hive.metastore.glue.datacatalog.enabled property to true, as shown in the following JSON example.

    [ { "Classification": "presto-connector-hive", "Properties": { "hive.metastore.glue.datacatalog.enabled": "true" } } ]

Optionally, you can manually set hive.metastore.glue.datacatalog.enabled=true in the /etc/presto/conf/catalog/hive.properties file on the master node. If you use this method, make sure that hive.table-statistics-enabled=false in the properties file is set because the Data Catalog does not support Hive table and partition statistics. If you change the value on a long-running cluster to switch metastores, you must restart the Presto server on the master node (sudo restart presto-server).

Unsupported Configurations, Functions, and Known Issues

The limitations listed below apply when using the AWS Glue Data Catalog as a metastore:

  • Renaming tables from within AWS Glue is not supported.

  • Partition values containing quotes and apostrophes are not supported (for example, PARTITION (owner="Doe's").

  • Table and partition statistics are not supported.

  • Using Hive authorization is not supported.

Adding Database Connectors

You can add JDBC connectors at cluster launch using the configuration classifications. For more information about connectors, see https://prestodb.io/docs/current/connector.html.

These classifications are named as follows:

  • presto-connector-blackhole

  • presto-connector-cassandra

  • presto-connector-hive

  • presto-connector-jmx

  • presto-connector-kafka

  • presto-connector-localfile

  • presto-connector-mongodb

  • presto-connector-mysql

  • presto-connector-postgresql

  • presto-connector-raptor

  • presto-connector-redis

  • presto-connector-tpch

Example Configuring a Cluster with the PostgreSQL JDBC

To launch a cluster with the PostgreSQL connector installed and configured, create a file, myConfig.json, with the following content:

[ { "Classification": "presto-connector-postgresql", "Properties": { "connection-url": "jdbc:postgresql://example.net:5432/database", "connection-user": "MYUSER", "connection-password": "MYPASS" }, "Configurations": [] } ]

Ue the following command to create the cluster:

aws emr create-cluster --name PrestoConnector --release-label emr-5.11.0 --instance-type m3.xlarge \ --instance-count 2 --applications Name=Hadoop Name=Hive Name=Pig Name=Presto \ --use-default-roles --no-auto-terminate --ec2-attributes KeyName=myKey \ --log-uri s3://my-bucket/logs --enable-debugging \ --configurations file://./myConfig.json

Using LDAP Authentication with Presto

Amazon EMR version 5.5.0 and later supports using Lightweight Directory Access Protocol (LDAP) authentication with Presto. To use LDAP, you must enable HTTPS access for the Presto coordinator (set http-server.https.enabled=true in config.properties on the master node). For configuration details, see LDAP Authentication in Presto documentation.

Enabling SSL/TLS for Internal Communication Between Nodes

With Amazon EMR version 5.6.0 and later, you can enable SSL/TLS secured communication between Presto nodes by using a security configuration to enable in-transit encryption. For more information, see Specifying Amazon EMR Encryption Options Using a Security Configuration. The default port for internal HTTPS is 8446. The port used for internal communication must be the same port used for HTTPS access to the Presto coordinator. The http-server.https.port=port_num parameter in the Presto config.properties file specifies the port.

When in-transit encryption is enabled, Amazon EMR does the following for Presto:

  • Distributes the artifacts you specify for in-transit encryption throughout the Presto cluster. For more information about encryption artifacts, see Providing Certificates for In-Transit Data Encryption.

  • Modifies the config.properties file for Presto as follows:

    • Sets http-server.http.enabled=false on core and task nodes, which disables HTTP in favor of HTTPS.

    • Sets http-server.https.*, internal-communication.https.*, and other values to enable HTTPS and specify implementation details, including LDAP parameters if you have enabled and configured LDAP.