Menu
Amazon Redshift
Database Developer Guide (API Version 2012-12-01)

Creating External Schemas for Amazon Redshift Spectrum

All external tables must be created in an external schema, which you create using a CREATE EXTERNAL SCHEMA statement.

Note

Some applications use the term database and schema interchangeably. In Amazon Redshift, we use the term schema.

An Amazon Redshift external schema references an external database in an external data catalog. You can create the external database in Amazon Redshift, in Amazon Athena, or in an Apache Hive metastore, such as Amazon EMR. If you create an external database in Amazon Redshift, the database resides in the Athena data catalog. To create a database in a Hive metastore, you need to create the database in your Hive application.

Amazon Redshift needs authorization to access the data catalog in Athena and the data files in Amazon S3 on your behalf. To provide that authorization, you first create an AWS Identity and Access Management (IAM) role. Then you attach the role to your cluster and provide Amazon Resource Name (ARN) for the role in the Amazon Redshift CREATE EXTERNAL SCHEMA statement. For more information about authorization, see IAM Policies for Amazon Redshift Spectrum.

To create an external database at the same time you create an external schema, specify FROM DATA CATALOG and include the CREATE EXTERNAL DATABASE clause in your CREATE EXTERNAL SCHEMA statement.

The following example creates an external schema named spectrum_schema using the external database spectrum_db.

Copy
create external schema spectrum_schema from data catalog database 'spectrum_db' iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole' create external database if not exists;

If you manage your data catalog using Athena, specify the Athena database name and the AWS Region in which the Athena data catalog is located.

The following example creates an external schema using the default sampledb database in the Athena data catalog.

Copy
create external schema athena_schema from data catalog database 'sampledb' iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole' region 'us-east-2';

Note

The region parameter references the region in which the Athena data catalog is located, not the location of the data files in Amazon S3.

When using the Athena data catalog, the following limits apply:

  • A maximum of 100 databases per account.

  • A maximum of 100 tables per database.

  • A maximum of 20,000 partitions per table.

You can request a limit increase by contacting AWS Support.

To avoid the limits, use a Hive metastore instead of an Athena data catalog.

If you manage your data catalog using a Hive metastore, such as Amazon EMR, your security groups must be configured to allow traffic between the clusters.

In the CREATE EXTERNAL SCHEMA statement, specify FROM HIVE METASTORE and include the metastore's URI and port number. The following example creates an external schema using a Hive metastore database named hive_db.

Copy
create external schema hive_schema from hive metastore database 'hive_db' uri '172.10.10.10' port 99 iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'

To view external schemas for your cluster, query the PG_EXTERNAL_SCHEMA catalog table or the SVV_EXTERNAL_SCHEMAS view. The following example queries SVV_EXTERNAL_SCHEMAS, which joins PG_EXTERNAL_SCHEMA and PG_NAMESPACE.

Copy
select * from svv_external_schemas

For the full command syntax and examples, see CREATE EXTERNAL SCHEMA.

Working with Amazon Redshift Spectrum External Catalogs

The metadata for Amazon Redshift Spectrum external databases and external tables is stored in an external data catalog. By default, Redshift Spectrum metadata is stored in an Athena data catalog. You can view and manage Redshift Spectrum databases and tables in your Athena console.

You can also create and manage external databases and external tables using Hive data definition language (DDL) using Athena or a Hive metastore, such as Amazon EMR.

Note

We recommend using Amazon Redshift to create and manage your Redshift Spectrum external databases and external tables.

Viewing Redshift Spectrum Databases in Athena

If you created an external database by including the CREATE EXTERNAL DATABASE IF NOT EXISTS clause as part of your CREATE EXTERNAL SCHEMA statement, the external database metadata is stored in your Athena data catalog. The metadata for external tables that you create qualified by the external schema is also stored in your Athena data catalog.

Athena maintains a data catalog for each supported AWS Region. To view table metadata, log in to the Athena console and choose Catalog Manager. The following example shows the Athena Catalog Manager for the US West (Oregon) Region.

If you create and manage your external tables using Athena, register the database using CREATE EXTERNAL SCHEMA. For example, the following command registers the Athena database named sampledb.

Copy
create external schema athena_sample from data catalog database 'sampledb' iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole' region 'us-east-1';

When you query the SVV_EXTERNAL_TABLES system view, you see tables in the Athena sampledb database and also tables you created in Amazon Redshift.

Copy
select * from svv_external_tables;
schemaname    | tablename        | location                                               
--------------+------------------+--------------------------------------------------------
athena_sample | elb_logs         | s3://athena-examples/elb/plaintext           
athena_sample | lineitem_1t_csv  | s3://myspectrum/tpch/1000/lineitem_csv                
athena_sample | lineitem_1t_part | s3://myspectrum/tpch/1000/lineitem_partition          
spectrum      | sales            | s3://awssampledbuswest2/tickit/spectrum/sales          
spectrum      | sales_part       | s3://awssampledbuswest2/tickit/spectrum/sales_part

Registering an Apache Hive Metastore Database

If you create external tables in an Apache Hive metastore, you can use CREATE EXTERNAL SCHEMA to register those tables in Redshift Spectrum.

In the CREATE EXTERNAL SCHEMA statement, specify the FROM HIVE METASTORE clause and provide the Hive metastore URI and port number. The IAM role must include permission to access Amazon S3 but does not need any Athena permissions. The following example registers a Hive metastore.

Copy
create external schema if not exists hive_schema from hive metastore database 'hive_database' uri 'ip-10-0-111-111.us-west-2.compute.internal' port 9083 iam_role 'arn:aws:iam::123456789012:role/mySpectrumRole';

Enabling Your Amazon Redshift Cluster to Access Your Amazon EMR Cluster

If your Hive metastore is in Amazon EMR, you must give your Amazon Redshift cluster access to your Amazon EMR cluster. To do so, you create an Amazon EC2 security group and allow all inbound traffic to the EC2 security group from your Amazon Redshift cluster's security group and your Amazon EMR cluster's security group. Then you add the EC2 security to both your Amazon Redshift cluster and your Amazon EMR cluster.

To enable your Amazon Redshift cluster to access your Amazon EMR cluster

  1. In Amazon Redshift, make a note of your cluster's security group name. In the Amazon Redshift dashboard, choose your cluster. Find your cluster security groups in the Cluster Properties group.

  2. In Amazon EMR, make a note of the EMR master node security group name.

  3. Create or modify an Amazon EC2 security group to allow connection between Amazon Redshift and Amazon EMR:

    1. In the Amazon EC2 dashboard, choose Security Groups.

    2. Choose Create Security Group.

    3. If using VPC, choose the VPC that both your Amazon Redshift and Amazon EMR clusters are in.

    4. Add an inbound rule.

    5. For Type, choose TCP.

    6. For Source, choose Custom.

    7. Type the name of your Amazon Redshift security group.

    8. Add another inbound rule.

    9. For Type, choose TCP.

    10. For Port Range, type 9083.

      Note

      The default port for an EMR HMS is 9083. If your HMS uses a different port, specify that port in the inbound rule and in the external schema definition.

    11. For Source, choose Custom.

    12. Type the name of your Amazon EMR security group.

    13. Choose Create.

  4. Add the Amazon EC2 security group you created in the previous step to your Amazon Redshift cluster and to your Amazon EMR cluster:

    1. In Amazon Redshift, choose your cluster.

    2. Choose Cluster, Modify.

    3. In VPC Security Groups, add the new security group by pressing CRTL and choosing the new security group name.

    4. In Amazon EMR, choose your cluster.

    5. Under Hardware, choose the link for the Master node.

    6. Choose the link in the EC2 Instance ID column.

    7. Choose Actions, Networking, Change Security Groups.

    8. Choose the new security group.

    9. Choose Assign Security Groups.