Menu
Amazon EMR
Amazon EMR Release Guide

Apache Phoenix

Apache Phoenix is used for OLTP and operational analytics, allowing you to use standard SQL queries and JDBC APIs to work with an Apache HBase backing store. For more information, see Phoenix in 15 minutes or less.

Note

If you upgrade from an earlier version of Amazon EMR to Amazon EMR version 5.4.0 or later and use secondary indexing, upgrade local indexes as described in the Apache Phoenix documentation. Amazon EMR removes the required configurations from the hbase-site classification, but indexes need to be repopulated. Online and offline upgrade of indexes are supported. Online upgrades are the default, which means indexes are repopulated while initializing from Phoenix clients of version 4.8.0 or greater. To specify offline upgrades, set the phoenix.client.localIndexUpgrade configuration to false in the phoenix-site classification, and then SSH to the master node to run psql [zookeeper] -1.

Release Information

Application Amazon EMR Release Label Components installed with this application

Phoenix 4.9.0

emr-5.5.0

emrfs, emr-ddb, emr-goodies, emr-kinesis, emr-s3-dist-cp, hadoop-client, hadoop-hdfs-datanode, hadoop-hdfs-library, hadoop-hdfs-namenode, hadoop-httpfs-server, hadoop-kms-server, hadoop-mapred, hadoop-yarn-nodemanager, hadoop-yarn-resourcemanager, hbase-hmaster, hbase-client, hbase-region-server, phoenix-library, phoenix-query-server, zookeeper-client, zookeeper-server

Creating a Cluster with Phoenix

You install Phoenix by choosing the application when you create a cluster in the console or using the AWS CLI. The following procedures and examples show how to create a cluster with Phoenix and HBase. For more information about creating clusters using the console, including Advanced Options see Plan and Configure Clusters in the Amazon EMR Management Guide.

To launch a cluster with Phoenix installed using Quick Options for creating a cluster in the console

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Choose Create cluster to use Quick Create.

  3. For Software Configuration, choose the most recent release appropriate for your application. Phoenix appears as an option only when Amazon Release Version emr-4.7.0 or later is selected.

  4. For Applications, choose the second option, HBase: HBase ver with Ganglia ver, Hadoop ver, Hive ver, Hue ver, Phoenix ver, and ZooKeeper ver.

  5. Select other options as necessary and then choose Create cluster.

Note

Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).

The following example launches a cluster with Phoenix installed using default configuration settings.

To launch a cluster with Phoenix and HBase using the AWS CLI

  • Create the cluster with the following command:

    Copy
    aws emr create-cluster --name "Cluster with Phoenix" --release-label emr-5.5.0 \ --applications Name=Phoenix Name=HBase --ec2-attributes KeyName=myKey \ --instance-type m3.xlarge --instance-count 3 --use-default-roles

Customizing Phoenix Configurations When Creating a Cluster

When creating a cluster, you configure Phoenix by setting values in hbase-site.xml using the hbase-site configuration classification.

For more information, see Configuration and Tuning in the Phoenix documentation.

The following example demonstrates using a JSON file stored in Amazon S3 to specify the value of false for the phoenix.schema.dropMetaData property. Multiple properties can be specified for a single classification. For more information, see Configuring Applications. The create cluster command then references the JSON file as the --configurations parameter.

The contents of the JSON file saved to /mybucket/myfolder/myconfig.json is the following.

Copy
[ { "Classification": "hbase-site", "Properties": { "phoenix.schema.dropMetaData": "false" } } ]

The create cluster command that references the JSON file is shown in the following example.

Copy
aws emr create-cluster --release-label emr-5.5.0 --applications Name=Phoenix \ Name=HBase --instance-type m3.xlarge --instance-count 2 \ --configurations https://s3.amazonaws.com/mybucket/myfolder/myconfig.json

Phoenix Clients

You connect to Phoenix using either a JDBC client built with full dependencies or using the "thin client" that uses the Phoenix Query Server and can only be run on a master node of a cluster (e.g. by using an SQL client, a step, command line, SSH port forwarding, etc.). When using the "fat" JDBC client, it still needs to have access to all nodes of the cluster because it connects to HBase services directly. The "thin" Phoenix client only needs access to the Phoenix Query Server at a default port 8765. There are several scripts within Phoenix that use these clients.

Use an Amazon EMR step to query using Phoenix

The following procedure restores a snapshot from HBase and uses that data to run a Phoenix query. You can extend this example or create a new script that leverages Phoenix's clients to suit your needs.

  1. Create a cluster with Phoenix installed, using the following command:

    Copy
    aws emr create-cluster --name "Cluster with Phoenix" --log-uri s3://myBucket/myLogFolder --release-label emr-5.5.0 \ --applications Name=Phoenix Name=HBase --ec2-attributes KeyName=myKey \ --instance-type m3.xlarge --instance-count 3 --use-default-roles
  2. Create then upload the following files to Amazon S3:

    copySnapshot.sh

    Copy
    sudo su hbase -s /bin/sh -c 'hbase snapshot export \ -D hbase.rootdir=s3://us-east-1.elasticmapreduce.samples/hbase-demo-customer-data/snapshot/.hbase-snapshot \ -snapshot customer_snapshot1 \ -copy-to hdfs://masterDNSName:8020/user/hbase \ -mappers 2 -chuser hbase -chmod 700'

    runQuery.sh

    Copy
    aws s3 cp s3://myBucket/phoenixQuery.sql /home/hadoop/ /usr/lib/phoenix/bin/sqlline-thin.py http://localhost:8765 /home/hadoop/phoenixQuery.sql

    phoenixQuery.sql

    Copy
    CREATE VIEW "customer" ( pk VARCHAR PRIMARY KEY, "address"."state" VARCHAR, "address"."street" VARCHAR, "address"."city" VARCHAR, "address"."zip" VARCHAR, "cc"."number" VARCHAR, "cc"."expire" VARCHAR, "cc"."type" VARCHAR, "contact"."phone" VARCHAR); CREATE INDEX my_index ON "customer" ("customer"."state") INCLUDE("PK", "customer"."city", "customer"."expire", "customer"."type"); SELECT "customer"."type" AS credit_card_type, count(*) AS num_customers FROM "customer" WHERE "customer"."state" = 'CA' GROUP BY "customer"."type";

    Use the AWS CLI to submit the files to the S3 bucket:

    Copy
    aws s3 cp copySnapshot.sh s3://myBucket/ aws s3 cp runQuery.sh s3://myBucket/ aws s3 cp phoenixQuery.sql s3://myBucket/
  3. Create a table using the following step submitted to the cluster that you created in Step 1:

    createTable.json

    Copy
    [ { "Name": "Create HBase Table", "Args": ["bash", "-c", "echo $'create \"customer\",\"address\",\"cc\",\"contact\"' | hbase shell"], "Jar": "command-runner.jar", "ActionOnFailure": "CONTINUE", "Type": "CUSTOM_JAR" } ]

    Copy
    aws emr add-steps --cluster-id j-2AXXXXXXGAPLF \ --steps file://./createTable.json
  4. Use script-runner.jar to run the copySnapshot.sh script that you previously uploaded to your S3 bucket:

    Copy
    aws emr add-steps --cluster-id j-2AXXXXXXGAPLF \ --steps Type=CUSTOM_JAR,Name="HBase Copy Snapshot",ActionOnFailure=CONTINUE,\ Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://myBucket/copySnapshot.sh"]

    This runs a MapReduce job to copy your snapshot data to the cluster HDFS.

  5. Restore the snapshot that you copied to the cluster using the following step:

    restoreSnapshot.json

    Copy
    [ { "Name": "restore", "Args": ["bash", "-c", "echo $'disable \"customer\"; restore_snapshot \"customer_snapshot1\"; enable \"customer\"' | hbase shell"], "Jar": "command-runner.jar", "ActionOnFailure": "CONTINUE", "Type": "CUSTOM_JAR" } ]

    Copy
    aws emr add-steps --cluster-id j-2AXXXXXXGAPLF \ --steps file://./restoreSnapshot.json
  6. Use script-runner.jar to run the runQuery.sh script that you previously uploaded to your S3 bucket:

    Copy
    aws emr add-steps --cluster-id j-2AXXXXXXGAPLF \ --steps Type=CUSTOM_JAR,Name="Phoenix Run Query",ActionOnFailure=CONTINUE,\ Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://myBucket/runQuery.sh"]

    The query runs and returns the results to the step's stdout. It may take a few minutes for this step to complete.

  7. Inspect the results of the step's stdout at the log URI that you used when you created the cluster in Step 1. The results should look like the following:

    Copy
    +------------------------------------------+-----------------------------------+ | CREDIT_CARD_TYPE | NUM_CUSTOMERS | +------------------------------------------+-----------------------------------+ | american_express | 5728 | | dankort | 5782 | | diners_club | 5795 | | discover | 5715 | | forbrugsforeningen | 5691 | | jcb | 5762 | | laser | 5769 | | maestro | 5816 | | mastercard | 5697 | | solo | 5586 | | switch | 5781 | | visa | 5659 | +------------------------------------------+-----------------------------------+