Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

Create a Metastore Outside the Hadoop Cluster

Hive records metastore information in a MySQL database that is located, by default, on the master node. The metastore contains a description of the input data, including the partition names and data types, contained in the input files.

When a cluster terminates, all associated cluster nodes shut down. All data stored on a cluster node, including the Hive metastore, is deleted. Information stored elsewhere, such as in your Amazon S3 bucket, persists.

If you have multiple clusters that share common data and update the metastore, you should locate the shared metastore on persistent storage.

To share the metastore between clusters, override the default location of the MySQL database to an external persistent storage location.

Note

Hive neither supports nor prevents concurrent write access to metastore tables. If you share metastore information between two clusters, you must ensure that you do not write to the same metastore table concurrently—unless you are writing to different partitions of the same metastore table.

The following procedure shows you how to override the default configuration values for the Hive metastore location and start a cluster using the reconfigured metastore location.

To create a metastore located outside of the cluster

  1. Create a MySQL database.

    Relational Database Service (RDS) provides a cloud-based MySQL database. Instructions on how to create an Amazon RDS database are at http://aws.amazon.com/rds/.

  2. Modify your security groups to allow JDBC connections between your MySQL database and the ElasticMapReduce-Master security group.

    Instructions on how to modify your security groups for access are at http://aws.amazon.com/rds/faqs/#31.

  3. Set the JDBC configuration values in hive-site.xml:

    1. Create a hive-site.xml configuration file containing the following information:

      <configuration>
        <property>
          <name>javax.jdo.option.ConnectionURL</name>
          <value>jdbc:mysql://hostname:3306/hive?createDatabaseIfNotExist=true</value>
          <description>JDBC connect string for a JDBC metastore</description>
        </property>
        <property>
          <name>javax.jdo.option.ConnectionDriverName</name>
          <value>com.mysql.jdbc.Driver</value>
          <description>Driver class name for a JDBC metastore</description>
        </property>
        <property>
          <name>javax.jdo.option.ConnectionUserName</name>
          <value>username</value>
          <description>Username to use against metastore database</description>
        </property>
        <property>
          <name>javax.jdo.option.ConnectionPassword</name>
          <value>password</value>
          <description>Password to use against metastore database</description>
        </property>
      </configuration>

      <hostname> is the DNS address of the Amazon RDS instance running MySQL. <username> and <password> are the credentials for your MySQL database.

      The MySQL JDBC drivers are installed by Amazon EMR.

      Note

      The value property should not contain any spaces or carriage returns. It should appear all on one line.

    2. Save your hive-site.xml file to a location on Amazon S3, such as s3://myawsbucket/conf/hive-site.xml.

  4. Create a cluster and specify the Amazon S3 location of the new Hive configuration file, for example:

    In the directory where you installed the Amazon EMR CLI, run the following from the command line. For more information, see the Command Line Interface Reference for Amazon EMR.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive \
        --name "Hive cluster"    \
        --hive-interactive \
        --hive-site=s3://myawsbucket/conf/hive-site.xml
    • Windows users:

      ruby elastic-mapreduce --create --alive --name "Hive cluster" --hive-interactive --hive-site=s3://myawsbucket/conf/hive-site.xml

    The --hive-site parameter installs the configuration values in hive-site.xml in the specified location. The --hive-site parameter overrides only the values defined in hive-site.xml.

  5. Connect to the master node of your cluster.

    Instructions on how to connect to the master node are available in the Amazon Elastic MapReduce (Amazon EMR) Getting Started Guide.

  6. Create your Hive tables specifying the location on Amazon S3 by entering a command similar to the following:

    CREATE EXTERNAL TABLE IF NOT EXISTS table_name
    (
    key int,
    value int
    )
    LOCATION s3://myawsbucket/hdfs/
  7. Add your Hive script to the running cluster.

Your Hive cluster runs using the metastore located on Amazon S3. Launch all additional Hive clusters that share this metastore by specifying the metastore location.