Menu
Amazon EMR
Developer Guide

Using an External MySQL Database or Amazon Aurora

To use an external MySQL database or Amazon Aurora as your Hive metastore, you override the default configuration values for the metastore in Hive to specify the external database location, either on an Amazon RDS MySQL instance or an Amazon Aurora instance.

Note

Hive neither supports nor prevents concurrent write access to metastore tables. If you share metastore information between two clusters, you must ensure that you do not write to the same metastore table concurrently, unless you are writing to different partitions of the same metastore table.

The following procedure shows you how to override the default configuration values for the Hive metastore location and start a cluster using the reconfigured metastore location.

To create a metastore located outside of the EMR cluster

  1. Create a MySQL or Aurora database.

    For information about how to create an Amazon RDS database, see https://aws.amazon.com/rds/.

  2. Modify your security groups to allow JDBC connections between your database and the ElasticMapReduce-Master security group.

    For information about how to modify your security groups for access, see https://aws.amazon.com/rds/faqs/#security.

  3. Set the JDBC configuration values in hive-site.xml:

    1. Create a hive-site.xml configuration file containing the following information:

      Copy
      <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mariadb://hostname:3306/hive?createDatabaseIfNotExist=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> <description>Username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>password</value> <description>Password to use against metastore database</description> </property> </configuration>

      hostname is the DNS address of the Amazon RDS instance running the database. username and password are the credentials for your database. For more information about connecting to MySQL and Aurora database instances, see Connecting to a DB Instance Running the MySQL Database Engine and Connecting to an Aurora DB Cluster in the Amazon Relational Database Service User Guide.

      The JDBC drivers are installed by Amazon EMR.

      Note

      The value property should not contain any spaces or carriage returns. It should appear all on one line.

    2. Save your hive-site.xml file to a location on Amazon S3, such as s3://mybucket/hive-site.xml.

  4. Create a cluster and specify the Amazon S3 location of the new Hive configuration file.

    AWS CLI

    To specify the location of the configuration file using the AWS CLI, type the following command, replace myKey with the name of your EC2 key pair, and replace mybucket with your Amazon S3 bucket name.

    • Linux, UNIX, and Mac OS X users:

      Copy
      aws emr create-cluster --name "Test cluster" --ami-version 3.10 --applications Name=Hue Name=Hive Name=Pig \ --use-default-roles --ec2-attributes KeyName=myKey \ --instance-type m3.xlarge --instance-count 3 \ --bootstrap-actions Name="Install Hive Site Configuration",Path="s3://region.elasticmapreduce/libs/hive/hive-script",\ Args=["--base-path","s3://elasticmapreduce/libs/hive","--install-hive-site","--hive-site=s3://mybucket/hive-site.xml","--hive-versions","latest"]

      Where region is the region where you are launching your cluster.

    • Windows users:

      Copy
      aws emr create-cluster --name "Test cluster" --ami-version 3.10 --applications Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3 --bootstrap-actions Name="Install Hive Site Configuration",Path="s3://elasticmapreduce/libs/hive/hive-script",Args=["--base-path","s3://region.elasticmapreduce/libs/hive","--install-hive-site","--hive-site=s3://mybucket/hive-site.xml","--hive-versions","latest"]

      Where region is the region where you are launching your cluster.

    The us-east-1 endpoint used above, https://s3.amazonaws.com/, should be replaced with the Amazon S3 endpoint for the region in which your cluster will be located.

    Note

    Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

    When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    Note

    If you have not previously created the default EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

    For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

  5. Connect to the master node of your cluster.

    Instructions on how to connect to the master node are available at Connect to the Master Node Using SSH.

  6. Create your Hive tables specifying the location on Amazon S3 by entering a command similar to the following:

    Copy
    CREATE EXTERNAL TABLE IF NOT EXISTS table_name ( key int, value int ) LOCATION s3://mybucket/hdfs/
  7. Add your Hive script to the running cluster.

Your Hive cluster runs using the metastore located in Amazon RDS. Launch all additional Hive clusters that share this metastore by specifying the metastore location.