Connecting Athena to a Hive Metastore Using an Existing IAM Execution Role - Amazon Athena

Connecting Athena to a Hive Metastore Using an Existing IAM Execution Role

To connect your external Hive metastore to Athena with a Lambda function that uses an existing IAM role, you can use Athena's reference implementation of the Athena connector for external Hive metastore.

The three major steps are as follows:

  1. Clone and Build – Clone the Athena reference implementation and build the JAR file that contains the Lambda function code.

  2. AWS Lambda console – In the AWS Lambda console, create a Lambda function, assign it an existing IAM execution role, and upload the function code that you generated.

  3. Amazon Athena console – In the Amazon Athena console, create a data catalog name that you can use to refer to your external Hive metastore in your Athena queries.

If you already have permissions to create a custom IAM role, you can use a simpler workflow that uses the Athena console and the AWS Serverless Application Repository to create and configure a Lambda function. For more information, see Connecting Athena to an Apache Hive Metastore.

Prerequisites

Clone and Build the Lambda function

The function code for the Athena reference implementation is a Maven project located on GitHub at awslabs/aws-athena-hive-metastore. For detailed information about the project, see the corresponding README file on GitHub or the Reference Implementation topic in this documentation.

To clone and build the Lambda function code

  1. Enter the following command to clone the Athena reference implementation:

    git clone https://github.com/awslabs/aws-athena-hive-metastore
  2. Run the following command to build the .jar file for the Lambda function:

    mvn clean install

    After the project builds successfully, the following .jar file is created in the target folder of your project:

    hms-lambda-func-1.0-SNAPSHOT-withdep.jar

    In the next section, you use the AWS Lambda console to upload this file to your Amazon Web Services account.

Create and Configure the Lambda Function in the AWS Lambda Console

In this section, you use the AWS Lambda console to create a function that uses an existing IAM execution role. After you configure a VPC for the function, you upload the function code and configure the environment variables for the function.

Create the Lambda Function

In this step, you create a function in the AWS Lambda console that uses an existing IAM role.

To create a Lambda function that uses an existing IAM role

  1. Sign in to the AWS Management Console and open the AWS Lambda console at https://console.aws.amazon.com/lambda/.

  2. In the navigation pane, choose Functions.

  3. Choose Create function.

  4. Choose Author from scratch.

  5. For Function name, enter the name of your Lambda function (for example, EHMSBasedLambda).

  6. For Runtime, choose Java 8.

    
                            Creating a function in the Lambda console.
  7. Under Permissions, expand Change default execution role.

  8. For Execution role, choose Use an existing role.

  9. For Existing role, choose the IAM execution role that your Lambda function will use for Athena (this example uses a role called AthenaLambdaExecutionRole).

    
                            Choosing an existing IAM execution role for the Lambda
                                function.
  10. Expand Advanced settings.

  11. For VPC, choose the VPC that your function will have access to.

  12. For Subnets, choose the VPC subnets for Lambda to use.

  13. For Security groups, choose the VPC security groups for Lambda to use.

  14. Choose Create function. The AWS Lambda console and opens the configuration page for your function and begins creating your function.

    
                            Specifying the VPC details for the Lambda function.

Upload the Code and Configure the Lambda function

When the console informs you that your function has been successfully created, you are ready to upload the function code and configure its environment variables.

To upload your Lambda function code and configure its environment variables

  1. In the Lambda console, navigate to the page for your function if necessary.

  2. For Function code, choose Actions, and then choose Upload a .zip or .jar file.

    
                            Uploading the function code for the Lambda function.
  3. Upload the hms-lambda-func-1.0-SNAPSHOT-withdep.jar file that you generated previously.

  4. In the Environment variables section of the configuration page for your function, choose Edit.

    
                            Choose Edit to edit the environment
                                variables for the Lambda function.
  5. On the Edit environment variables page, add the following environment variable keys and values:

    • HMS_URIS – Use the following syntax to enter the URI of your Hive metastore host that uses the Thrift protocol at port 9083.

      thrift://<host_name>:9083.
    • SPILL_LOCATION – Specify an Amazon S3 location in your Amazon Web Services account to hold spillover metadata if the Lambda function response size exceeds 4MB.

      
                                    Specifying values for the Lambda function environment
                                        variables.
  6. Choose Save.

Connect Athena to Your Hive Metastore

Now you can use the Athena console to prepare Athena to use your Hive metastore. In this step, you create a data catalog name to use in your Athena queries that refers to your external Hive metastore.

To connect Athena to your Hive metastore

  1. Open the Athena console at https://console.aws.amazon.com/athena/.

  2. Do one of the following:

    • In the Query Editor navigation pane, choose Connect data source.

      
                                Choose Connect data source.
    • Choose the Data sources tab, and then choose Connect data source.

      
                                Choose Data sources, Connect
                                        data source.
  3. On the Connect data source page, for Choose a metadata catalog, choose Apache Hive metastore.

    
                        Choose Apache Hive metastore.
  4. Choose Next.

  5. On the Connection details page, for Lambda function, use the Choose Lambda function option to choose the Lambda function that you created.

    
                        Choose the Lambda function that you created.

    A new Lambda function ARN entry shows the ARN of your Lambda function.

    
                        Create a unique data catalog name to use in your SQL queries in
                            Athena.
  6. For Catalog name, enter a unique name that you will use in your SQL queries to reference your Hive data source.

    Note

    The names awsdatacatalog, hive, jmx, and system are reserved by Athena and cannot be used for custom catalog names.

  7. Choose Connect. This connects Athena to your Hive metastore catalog.

  8. You can now use the Catalog name that you specified to reference the Hive metastore in your SQL queries. In your SQL queries, use the following example syntax, replacing ehms-catalog with the catalog name that you specified earlier.

    SELECT * FROM ehms-catalog.CustomerData.customers;
  9. To view, edit, or delete the data sources that you create, see Managing Data Sources.