Menu
Amazon EMR
Developer Guide

Step 5: Query Your Data Using Hue

This documentation is for AMI versions 2.x and 3.x of Amazon EMR. For information about Amazon EMR releases 4.0.0 and above, see the Amazon EMR Release Guide. For information about managing the Amazon EMR service in 4.x releases, see the Amazon EMR Management Guide.

After you run the Hive script that creates your table and loads the data, log into the Hue web interface and submit an interactive query against the data. Hue is an open source web user interface for Hadoop that allows technical and non-technical users to take advantage of Hive, Pig, and many of the other tools that are part of the Hadoop and Amazon EMR ecosystem. Using Hue gives data analysts and data scientists a simple way to query data and create scripts interactively.

Before logging into Hue, create an SSH tunnel to the Amazon EMR master node.

Create an SSH Tunnel to the Master Node

To connect to Hue and run the script, you must connect to the master node via SSH and establish a tunnel to the Hue interface running on port 8888. Creating the SSH connection requires:

  • An SSH client such as PuTTY (Windows) or OpenSSH (Linux, Mac OS X)

  • An Amazon EC2 key pair private key file (.ppk for Windows or .pem for Linux and Mac OS X)

Your Linux computer most likely includes an SSH client by default. For example, OpenSSH is installed on most Linux, Unix, and Mac OS X operating systems. You can check for an SSH client by typing ssh at a shell command line. If your computer doesn't recognize the command, you must install an SSH client to connect to the master node. The OpenSSH project provides a free implementation of the full suite of SSH tools. For more information, go to http://www.openssh.org.

For more information about creating an SSH tunnel, see Connect to the Master Node Using SSH.

To create an SSH tunnel to the master node on Linux and Mac OS X using OpenSSH

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. In the console, on the Cluster List page, select a cluster.

  3. On the Cluster Details page, note the Master public DNS value that appears at the top. This value is required to establish your SSH connection.

  4. In a terminal window, type the following command to open an SSH tunnel on your local machine. This command accesses the Hue web interface by forwarding traffic on local port 8157 (a randomly chosen, unused local port) to port 8888 on the master node. In the command, replace ~/mykeypair.pem with the location and file name of your .pem file and replace ec2-###-##-##-###.compute-1.amazonaws.com with the master public DNS name of your cluster.

    ssh -i ~/mykeypair.pem -N -L 8157:ec2-###-##-##-###.compute-1.amazonaws.com:8888 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com

    After you issue this command, the terminal remains open and does not return a response.

    Note

    -L signifies the use of local port forwarding which allows you to specify a local port used to forward data to the identified remote port on the master node's local web server.

To set up an SSH tunnel to the master node on Windows using PuTTY

Windows users can use an SSH client such as PuTTY to connect to the master node. Before connecting to the Amazon EMR master node, you should download and install PuTTY and PuTTYgen. You can download these tools from the PuTTY download page.

PuTTY does not natively support the key pair private key file format (.pem) generated by Amazon EC2. You use PuTTYgen to convert your key file to the required PuTTY format (.ppk). You must convert your key into this format (.ppk) before attempting to connect to the master node using PuTTY.

For more information about converting your key, see Converting Your Private Key Using PuTTYgen in the Amazon EC2 User Guide for Linux Instances.

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. In the console, on the Cluster List page, select a cluster.

  3. On the Cluster Details page, note the Master public DNS value that appears. This value is required to establish your SSH connection.

  4. Open putty.exe to start PuTTY. You can also launch PuTTY from the Windows programs list.

  5. If necessary, in the Category list, choose Session.

  6. For Host Name (or IP address), type hadoop@MasterPublicDNS. For example: hadoop@ec2-###-##-##-###.compute-1.amazonaws.com.

  7. In the Category list, expand Connection, SSH and then choose Auth.

  8. For Private key file for authentication, choose Browse and select the .ppk file that you generated.

  9. For Category, choose Connection, SSH, and Tunnels.

  10. For Source port, type an unused local port number, for example 8157.

  11. For Destination, type MasterPublicDNS:8888 to access the Hue interface, for example ec2-###-##-##-###.compute-1.amazonaws.com:8888.

  12. Leave the Local and Auto options selected.

  13. Choose Add. You should see an entry in the Forwarded ports field similar to: L8157 ec2-###-##-##-###.compute-1.amazonaws.com:8888.

  14. Choose Open and Yes to dismiss the PuTTY security alert.

    Important

    When you log into the master node and are prompted for a user name, type hadoop.

Log into Hue and Submit an Interactive Hive Query

After configuring your SSH tunnel to the Amazon EMR master node, log into Hue and run the Hive script.

To run the Hive script in Hue

  1. Type the following URL in your browser: http://localhost:8157.

  2. At the Hue welcome page, type a Username and Password. The name and password used the first time you log into Hue become the Hue superuser credentials.

    Note

    The password must be at least 8 characters long, and must contain both uppercase and lowercase letters, at least one number, and at least one special character.

  3. At the Did you know? dialog, choose Got it, prof! When the My Documents page opens, the sample projects are displayed.

  4. From the menu options, choose Query Editors > Hive.

  5. Delete the sample text and type:

    SELECT browser, COUNT(*) count FROM cloudfront_logs WHERE date BETWEEN '2014-07-05' AND '2014-08-05' GROUP BY browser;

    This HiveQL query retrieves the total requests per browser for a given time frame.

  6. Choose Execute. As the query runs, log entries are displayed on the Log tab in the window below. When the query completes, the Results tab is displayed.

  7. Review the output data from the query.

  8. After examining the output, close your browser, or as time permits, continue to explore Hue.