Menu
Amazon EMR
Amazon EMR Release Guide

Create a Hive Metastore Outside the Cluster

Hive records metastore information in a MySQL database that is located, by default, on the master node. The metastore contains a description of the input data, including the partition names and data types, contained in the input files.

When a cluster terminates, all associated cluster nodes shut down. All data stored on a cluster node, including the Hive metastore, is deleted. Information stored elsewhere, such as in your Amazon S3 bucket, persists.

If you have multiple clusters that share common data and update the metastore, you should locate the shared metastore on persistent storage.

To share the metastore between clusters, override the default location of the MySQL database to an external persistent storage location on an Amazon RDS instance.

Note

Hive neither supports nor prevents concurrent write access to metastore tables. If you share metastore information between two clusters, you must ensure that you do not write to the same metastore table concurrently—unless you are writing to different partitions of the same metastore table.

The following procedure shows you how to override the default configuration values for the Hive metastore location and start a cluster using the reconfigured metastore location.

To create a metastore located outside of the cluster

  1. Create a MySQL database.

    Relational Database Service (RDS) provides a cloud-based MySQL database. For information about how to create an Amazon RDS database, see https://aws.amazon.com/rds/.

  2. Modify your security groups to allow JDBC connections between your MySQL database and the ElasticMapReduce-Master security group.

    For information about how to modify your security groups for access, see https://aws.amazon.com/rds/faqs/#security.

  3. Set JDBC configuration values in hive-site.xml:

    1. Important

      If you supply sensitive information, such as passwords, to the Amazon EMR configuration API, this information is displayed for those accounts that have sufficient permissions. If you are concerned that this information could be displayed to other users, create the cluster with an administrative account and limit other users (IAM users or those with delegated credentials) to accessing services on the cluster by creating a role which explicitly denies permissions to the elasticmapreduce:DescribeCluster API key.

      Create a configuration file called hiveConfiguration.json containing edits to hive-site.xml:

      [
          {
            "Classification": "hive-site",
            "Properties": {
              "javax.jdo.option.ConnectionURL": "jdbc:mysql:\/\/hostname:3306\/hive?createDatabaseIfNotExist=true",
              "javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",
              "javax.jdo.option.ConnectionUserName": "username",
              "javax.jdo.option.ConnectionPassword": "password"
            }
          }
        ]
      

      Note

      For Amazon EMR versions 4.0.0 or below, the driver used is org.mysql.jdbc.Driver for javax.jdo.option.ConnectionDriverName.

      Use hiveConfiguration.json with the following AWS CLI command to create the cluster:

      aws emr create-cluster --release-label emr-5.2.1 --instance-type m3.xlarge --instance-count 2 \
      --applications Name=Hive --configurations ./hiveConfiguration.json --use-default-roles

      Note

      For Windows, replace the above Linux line continuation character (\) with the caret (^).

      Note

      If you plan to store your configuration in Amazon S3, you must specify the URL location of the object. For example:

      aws emr create-cluster --release-label emr-5.2.1 --instance-type m3.xlarge --instance-count 3 \
      --applications Name=Hive --configurations https://s3.amazonaws.com/mybucket/myfolder/hiveConfiguration.json --use-default-roles

      <hostname> is the DNS address of the Amazon RDS instance running MySQL. <username> and <password> are the credentials for your MySQL database. javax.jdo.option.ConnectionURL is the JDBC connect string for a JDBC metastore. javax.jdo.option.ConnectionDriverName is the driver class name for a JDBC metastore.

      The MySQL JDBC drivers are installed by Amazon EMR.

      Note

      The value property should not contain any spaces or carriage returns. It should appear all on one line.

  4. Connect to the master node of your cluster.

    Instructions on how to connect to the master node are available at Connect to the Master Node Using SSH in the Amazon EMR Management Guide.

  5. Create your Hive tables specifying the location on Amazon S3 by entering a command similar to the following:

    CREATE EXTERNAL TABLE IF NOT EXISTS table_name
    (
    key int,
    value int
    )
    LOCATION s3://mybucket/hdfs/
  6. Add your Hive script to the running cluster.

Your Hive cluster runs using the metastore located in Amazon RDS. Launch all additional Hive clusters that share this metastore by specifying the metastore location.