Menu
Amazon EMR
Developer Guide

Impala Examples Included on the Amazon EMR AMI

There are example data sets and queries included with the Impala installation on the Amazon EMR AMI.

Topics

TPCDS

The TPCDS example is derived from Cloudera's Impala demo virtual machine.

To run the TPCDS example

  1. On the master node of the cluster, navigate to the examples directory and run the following scripts:

    Copy
    cd ~/impala/examples/tpcds/ ./tpcds-setup.sh ./tpcds-samplequery.sh

    The tpcds-setup.sh script loads data into HDFS and creates Hive tables. The tpcds-samplequery.sh script uses the following query to demonstrate how to use Impala to query data:

    Copy
    select i_item_id, s_state, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 FROM store_sales JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk) JOIN item on (store_sales.ss_item_sk = item.i_item_sk) JOIN customer_demographics on (store_sales.ss_cdemo_sk = customer_demographics.cd_demo_sk) JOIN store on (store_sales.ss_store_sk = store.s_store_sk) where cd_gender = 'M' and cd_marital_status = 'S' and cd_education_status = 'College' and d_year = 2002 and s_state in ('TN','SD', 'SD', 'SD', 'SD', 'SD') group by i_item_id, s_state order by i_item_id, s_state limit 100;
  2. Impala can create and manage Parquet tables. Parquet is a column-oriented binary file format intended to be highly efficient for scanning particular columns within a table. For more information, go to http://parquet.io/. After running the query, test the Parquet format by running the following script:

    Copy
    ./tpcds-samplequery-parquet.sh

Wikipedia

The Wikipedia example uses the data and sample queries from the Shark example in GitHub. For more information, go to https://github.com/amplab/shark/wiki/Running-Shark-on-EC2.

To run the Wikipedia example

  • On the master node of the cluster, navigate to the examples directory and run the following script:

    Copy
    cd ~/impala/examples/wikipedia/ ./wikipedia.sh

    Alternatively, you can use this script instead:

    Copy
    ./wikipedia-with-s3distcp.sh

    The wikipedia.sh and wikipedia-with-s3distcp.sh scripts copy 42 GB of data from Amazon S3 to HDFS, create Hive tables, and use Impala to select data from the Hive tables. The difference between wikipedia.sh and wikipedia-with-s3distcp.sh is that wikipedia.sh uses Hadoop distcp to copy data from Amazon S3 to HDFS, but wikipedia-with-s3distcp.sh uses Amazon EMR S3DistCp for the same purpose.

    The wikipedia-with-s3distcp.sh script contains the following code:

    Copy
    #!/bin/bash . /home/hadoop/impala/conf/impala.conf # Copy wikipedia data from s3 to hdfs hadoop jar /home/hadoop/lib/emr-s3distcp-1.0.jar -Dmapreduce.job.reduces=30 --src s3://spark-data/ --dest hdfs://$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/spark-data/ --outputCodec 'none' # Create hive tables hive -e "CREATE EXTERNAL TABLE wiki_small (id BIGINT, title STRING, last_modified STRING, xml STRING, text STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/spark-data/wikipedia-sample/'" hive -e "CREATE EXTERNAL TABLE wiki_full (id BIGINT, title STRING, last_modified STRING, xml STRING, text STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/spark-data/wikipedia-2010-09-12/'" hive -e "show tables" # Check client hostname client="127.0.0.1" echo "Checking client list..." nodelist=`curl -s http://$HADOOP_NAMENODE_HOST:9026/ws/v1/cluster/hostStatus` echo "Found client list: $nodelist" arr=$(echo $nodelist | tr "\"" "\n") for a in $arr do if [[ $a == ip-* || $a == domU-* || $a =~ ^[0-9] ]]; then client=$a fi done echo "Choose client $client" # Show tables impala-shell -r -i $client:21000 --query="show tables" # Query wiki_small table impala-shell -r -i $client:21000 --query="SELECT COUNT(1) FROM wiki_small WHERE TEXT LIKE '%Berkeley%'" impala-shell -r -i $client:21000 --query="SELECT title FROM wiki_small WHERE TEXT LIKE '%Berkeley%'" # Query wiki_full table impala-shell -r -i $client:21000 --query="SELECT COUNT(1) FROM wiki_full WHERE TEXT LIKE '%Berkeley%'" impala-shell -r -i $client:21000 --query="SELECT title FROM wiki_full WHERE TEXT LIKE '%Berkeley%'"

On this page: