Metastore configuration - Amazon EMR

Metastore configuration

A Hive metastore is a centralized location that stores structural information about your tables, including schemas, partition names, and data types. With EMR Serverless, you can persist this table metadata in a metastore that has access to your jobs.

You have two options for a Hive metastore:

  • The AWS Glue Data Catalog

  • An external Apache Hive metastore

Using the AWS Glue Data Catalog as a metastore

You can configure your Spark and Hive jobs to use the AWS Glue Data Catalog as its metastore. We recommend this configuration when you require a persistent metastore or a metastore shared by different applications, services, or AWS accounts. For more information about the Data Catalog, see Populating the AWS Glue Data Catalog. For information about AWS Glue pricing, see AWS Glue pricing.

You can configure your EMR Serverless job to use the AWS Glue Data Catalog either in the same AWS account as your application, or in a different AWS account.

Configure the AWS Glue Data Catalog

To configure the Data Catalog, choose which type of EMR Serverless application that you want to use.

Spark

When you use EMR Studio to run your jobs with EMR Serverless Spark applications, the AWS Glue Data Catalog is the default metastore.

When you use SDKs or AWS CLI, you can set the spark.hadoop.hive.metastore.client.factory.class configuration to com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory in the sparkSubmit parameters of your job run. The following example shows how to configure the Data Catalog with the AWS CLI.

aws emr-serverless start-job-run \ --application-id application-id \ --execution-role-arn job-role-arn \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://DOC-EXAMPLE-BUCKET/code/pyspark/extreme_weather.py", "sparkSubmitParameters": "--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf spark.driver.cores=1 --conf spark.driver.memory=3g --conf spark.executor.cores=4 --conf spark.executor.memory=3g" } }'

Alternatively, you can set this configuration when you create a new SparkSession in your Spark code.

from pyspark.sql import SparkSession spark = ( SparkSession.builder.appName("SparkSQL") .config( "spark.hadoop.hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory", ) .enableHiveSupport() .getOrCreate() ) # we can query tables with SparkSQL spark.sql("SHOW TABLES").show() # we can also them with native Spark print(spark.catalog.listTables())
Hive

For EMR Serverless Hive applications, the Data Catalog is the default metastore. That is, when you run jobs on a EMR Serverless Hive application, Hive records metastore information in the Data Catalog in the same AWS account as your application. You don't need a virtual private cloud (VPC) to use the Data Catalog as your metastore.

To access the Hive metastore tables, add the required AWS Glue policies outlined in Setting up IAM Permissions for AWS Glue.

Configure cross-account access for EMR Serverless and AWS Glue Data Catalog

To set up cross-account access for EMR Serverless, you must first sign in to the following AWS accounts:

  • AccountA – An AWS account where you have created an EMR Serverless application.

  • AccountB – An AWS account that contains a AWS Glue Data Catalog that you want your EMR Serverless job runs to access.

  1. Make sure an administrator or other authorized identity in AccountB attaches a resource policy to the Data Catalog in AccountB. This policy grants AccountA specific cross-account permissions to perform operations on resources in the AccountB catalog.

    { "Version" : "2012-10-17", "Statement" : [ { "Effect" : "Allow", "Principal": { "AWS": [ "arn:aws:iam::accountA:role/job-runtime-role-A" ]}, "Action" : [ "glue:GetDatabase", "glue:CreateDatabase", "glue:GetDataBases", "glue:CreateTable", "glue:GetTable", "glue:UpdateTable", "glue:DeleteTable", "glue:GetTables", "glue:GetPartition", "glue:GetPartitions", "glue:CreatePartition", "glue:BatchCreatePartition", "glue:GetUserDefinedFunctions" ], "Resource": ["arn:aws:glue:region:AccountB:catalog"] } ] }
  2. Add an IAM policy to the EMR Serverless job runtime role in AccountA so that role can access Data Catalog resources in AccountB.

    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "glue:GetDatabase", "glue:CreateDatabase", "glue:GetDataBases", "glue:CreateTable", "glue:GetTable", "glue:UpdateTable", "glue:DeleteTable", "glue:GetTables", "glue:GetPartition", "glue:GetPartitions", "glue:CreatePartition", "glue:BatchCreatePartition", "glue:GetUserDefinedFunctions" ], "Resource": ["arn:aws:glue:region:AccountB:catalog"] } ] }
  3. Start your job run. This step is slightly different depending on AccountA's EMR Serverless application type.

    Spark

    Set the spark.hadoop.hive.metastore.glue.catalogid property in the hive-site classification as shown in the following example. Replace AccountB-catalog-id with the ID of the Data Catalog in AccountB.

    aws emr-serverless start-job-run \ --application-id "application-id" \ --execution-role-arn "job-role-arn" \ --job-driver '{ "sparkSubmit": { "query": "s3://DOC-EXAMPLE-BUCKET/hive/scripts/create_table.sql", "parameters": "--hiveconf hive.exec.scratchdir=s3://DOC-EXAMPLE-BUCKET/hive/scratch --hiveconf hive.metastore.warehouse.dir=s3://DOC-EXAMPLE-BUCKET/hive/warehouse" } }' \ --configuration-overrides '{ "applicationConfiguration": [{ "classification": "hive-site", "properties": { "spark.hadoop.hive.metastore.glue.catalogid": "AccountB-catalog-id" } }] }'
    Hive

    Set the hive.metastore.glue.catalogid property in the hive-site classification as shown in the following example. Replace AccountB-catalog-id with the ID of the Data Catalog in AccountB.

    aws emr-serverless start-job-run \ --application-id "application-id" \ --execution-role-arn "job-role-arn" \ --job-driver '{ "hive": { "query": "s3://DOC-EXAMPLE-BUCKET/hive/scripts/create_table.sql", "parameters": "--hiveconf hive.exec.scratchdir=s3://DOC-EXAMPLE-BUCKET/hive/scratch --hiveconf hive.metastore.warehouse.dir=s3://DOC-EXAMPLE-BUCKET/hive/warehouse" } }' \ --configuration-overrides '{ "applicationConfiguration": [{ "classification": "hive-site", "properties": { "hive.metastore.glue.catalogid": "AccountB-catalog-id" } }] }'

Considerations when using the AWS Glue Data Catalog

You can add auxiliary JARs with ADD JAR in your Hive scripts. For additional considerations, see Considerations when using AWS Glue Data Catalog.

Using an external Hive metastore

You can configure your EMR Serverless Spark and Hive jobs to connect to an external Hive metastore, such as Amazon Aurora or Amazon RDS for MySQL. This section describes how to set up an Amazon RDS Hive metastore, configure your VPC, and configure your EMR Serverless jobs to use an external metastore.

Create an external Hive metastore

  1. Create an Amazon Virtual Private Cloud (Amazon VPC) with private subnets by following the instructions in Create a VPC.

  2. Create your EMR Serverless application with your new Amazon VPC and private subnets. When you configure your EMR Serverless application with a VPC, it first provisions an elastic network interface for each subnet that you specify. It then attaches your specified security group to that network interface. This gives your application access control. For more details about how to set up your VPC, see Configuring VPC access.

  3. Create a MySQL or Aurora PostgreSQL database in a private subnet in your Amazon VPC. For information about how to create an Amazon RDS database, see Creating an Amazon RDS DB instance.

  4. Modify the security group of your MySQL or Aurora database to allow JDBC connections from your EMR Serverless security group by following the steps in Modifying an Amazon RDS DB instance. Add a rule for inbound traffic to the RDS security group from one of your EMR Serverless security groups.

    Type Protocol Port range Source

    All TCP

    TCP

    3306

    emr-serverless-security-group

Configure Spark options

Using JDBC

To configure your EMR Serverless Spark application to connect to a Hive metastore based on an Amazon RDS for MySQL or Amazon Aurora MySQL instance, use a JDBC connection. Pass the mariadb-connector-java.jar with --jars in the spark-submit parameters of your job run.

aws emr-serverless start-job-run \ --application-id "application-id" \ --execution-role-arn "job-role-arn" \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://DOC-EXAMPLE-BUCKET/scripts/spark-jdbc.py", "sparkSubmitParameters": "--jars s3://DOC-EXAMPLE-BUCKET/mariadb-connector-java.jar --conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.mariadb.jdbc.Driver --conf spark.hadoop.javax.jdo.option.ConnectionUserName=<connection-user-name> --conf spark.hadoop.javax.jdo.option.ConnectionPassword=<connection-password> --conf spark.hadoop.javax.jdo.option.ConnectionURL=<JDBC-Connection-string> --conf spark.driver.cores=2 --conf spark.executor.memory=10G --conf spark.driver.memory=6G --conf spark.executor.cores=4" } }' \ --configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "s3://DOC-EXAMPLE-BUCKET/spark/logs/" } } }'

The following code example is a Spark entrypoint script that interacts with a Hive metastore on Amazon RDS.

from os.path import expanduser, join, abspath from pyspark.sql import SparkSession from pyspark.sql import Row # warehouse_location points to the default location for managed databases and tables warehouse_location = abspath('spark-warehouse') spark = SparkSession \ .builder \ .config("spark.sql.warehouse.dir", warehouse_location) \ .enableHiveSupport() \ .getOrCreate() spark.sql("SHOW DATABASES").show() spark.sql("CREATE EXTERNAL TABLE `sampledb`.`sparknyctaxi`(`dispatching_base_num` string, `pickup_datetime` string, `dropoff_datetime` string, `pulocationid` bigint, `dolocationid` bigint, `sr_flag` bigint) STORED AS PARQUET LOCATION 's3://<s3 prefix>/nyctaxi_parquet/'") spark.sql("SELECT count(*) FROM sampledb.sparknyctaxi").show() spark.stop()

Using the thrift service

You can configure your EMR Serverless Hive application to connect to a Hive metastore based on an Amazon RDS for MySQL or Amazon Aurora MySQL instance. To do this, run a thrift server on the master node of an existing Amazon EMR cluster. This option is ideal if you already have an Amazon EMR cluster with a thrift server that you want to use to simplify your EMR Serverless job configurations.

aws emr-serverless start-job-run \ --application-id "application-id" \ --execution-role-arn "job-role-arn" \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://DOC-EXAMPLE-BUCKET/thriftscript.py", "sparkSubmitParameters": "--jars s3://DOC-EXAMPLE-BUCKET/mariadb-connector-java.jar --conf spark.driver.cores=2 --conf spark.executor.memory=10G --conf spark.driver.memory=6G --conf spark.executor.cores=4" } }' \ --configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "s3://DOC-EXAMPLE-BUCKET/spark/logs/" } } }'

The following code example is an entrypoint script (thriftscript.py) that uses thrift protocol to connect to a Hive metastore. Note that the hive.metastore.uris property needs to be set to read from an external Hive metastore.

from os.path import expanduser, join, abspath from pyspark.sql import SparkSession from pyspark.sql import Row # warehouse_location points to the default location for managed databases and tables warehouse_location = abspath('spark-warehouse') spark = SparkSession \ .builder \ .config("spark.sql.warehouse.dir", warehouse_location) \ .config("hive.metastore.uris","thrift://thrift-server-host:thift-server-port") \ .enableHiveSupport() \ .getOrCreate() spark.sql("SHOW DATABASES").show() spark.sql("CREATE EXTERNAL TABLE sampledb.`sparknyctaxi`( `dispatching_base_num` string, `pickup_datetime` string, `dropoff_datetime` string, `pulocationid` bigint, `dolocationid` bigint, `sr_flag` bigint) STORED AS PARQUET LOCATION 's3://<s3 prefix>/nyctaxi_parquet/'") spark.sql("SELECT * FROM sampledb.sparknyctaxi").show() spark.stop()

Configure Hive options

Using JDBC

If you want to specify an external Hive database location on either an Amazon RDS MySQL or Amazon Aurora instance, you can override the default metastore configuration.

Note

In Hive, you can perform multiple writes to metastore tables at the same time. If you share metastore information between two jobs, make sure that you don't write to the same metastore table simultaneously unless you write to different partitions of the same metastore table.

Set the following configurations in the hive-site classification to activate the external Hive metastore.

{ "classification": "hive-site", "properties": { "hive.metastore.client.factory.class": "org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClientFactory", "javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver", "javax.jdo.option.ConnectionURL": "jdbc:mysql://db-host:db-port/db-name", "javax.jdo.option.ConnectionUserName": "username", "javax.jdo.option.ConnectionPassword": "password" } }

Using a thrift server

You can configure your EMR Serverless Hive application to connect to a Hive metastore based on an Amazon RDS for MySQL or Amazon Aurora MySQLinstance. To do this, run a thrift server on the main node of an existing Amazon EMR cluster. This option is ideal if you already have an Amazon EMR cluster that runs a thrift server and you want to use your EMR Serverless job configurations.

Set the following configurations in the hive-site classification so that EMR Serverless can access the remote thrift metastore. Note that you must set the hive.metastore.uris property to read from an external Hive metastore.

{ "classification": "hive-site", "properties": { "hive.metastore.client.factory.class": "org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClientFactory", "hive.metastore.uris": "thrift://thrift-server-host:thirft-server-port" } }

Considerations when using an external metastore

  • You can configure databases that are compatible with MariaDB JDBC as your metastore. Examples of these databases are RDS for MariaDB, MySQL, and Amazon Aurora.

  • Metastores aren't auto-initialized. If your metastore isn't initialized with a schema for your Hive version, use the Hive Schema Tool.

  • EMR Serverless doesn't support Kerberos authentication. You can't use a thrift metastore server with Kerberos authentication with EMR Serverless Spark or Hive jobs.