Menu
Amazon Redshift
Database Developer Guide (API Version 2012-12-01)

Creating External Tables for Amazon Redshift Spectrum

Amazon Redshift Spectrum uses external tables to query data that is stored in Amazon S3. You can query an external table using the same SELECT syntax you use with other Amazon Redshift tables. External tables are read-only. You can't write to an external table.

You can create an external table in Amazon Redshift, in Amazon Athena, or in an Apache Hive metastore. For information about creating tables in Athena, see Getting Started in the Amazon Athena User Guide. For information about using a Hive metastore, see Apache Hive in the Amazon EMR documentation.

Important

Redshift Spectrum doesn't support nested data types, such as STRUCT, ARRAY, and MAP.

If your external table is defined in Athena or a Hive metastore, you first create an external schema that references the external database. Then you can reference the external table in your SELECT statement by prefixing the table name with the schema name, without needing to create the table in Amazon Redshift. For more information, see Creating External Schemas for Amazon Redshift Spectrum.

For example, suppose that you have an external table named lineitem_athena defined in an Athena external catalog. In this case, you can define an external schema named athena_schema, then query the table using the following SELECT statement.

Copy
select count(*) from athena_schema.lineitem_athena;

To define an external table in Amazon Redshift, use the CREATE EXTERNAL TABLE command. The external table statement defines the table columns, the format of your data files, and the location of your data in Amazon S3. Redshift Spectrum scans the files in the specified folder and any subfolders. Redshift Spectrum ignores hidden files and files that begin with a period, underscore, or hash mark ( . , _, or #) or end with a tilde (~).

The following example creates a table named SALES in the Amazon Redshift external schema named spectrum. The data is in tab-delimited text files.

Copy
create external table spectrum.sales( salesid integer, listid integer, sellerid integer, buyerid integer, eventid integer, dateid smallint, qtysold smallint, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp) row format delimited fields terminated by '\t' stored as textfile location 's3://awssampledbuswest2/tickit/spectrum/sales/' table properties ('numRows'='172000');

To view external tables, query the SVV_EXTERNAL_TABLES system view.

Partitioning Redshift Spectrum External Tables

When you partition your data, you can restrict the amount of data Redshift Spectrum scans by filtering on the partition key. You can partition your data by any key. A common practice is to partition the data based on time. For example, you might choose to partition by year, month, date, and hour. If you have data coming from multiple sources, you might partition by a data source identifier and date.

The following procedure describes how to partition your data.

To partition your data

  1. Store your data in folders in Amazon S3 according to your partition key.

    Create one folder for each partition value and name the folder with the partition key and value. For example, if you partition by date, you might have folders named saledate=2017-04-31, saledate=2017-04-30, and so on. Redshift Spectrum scans the files in the partition folder, but doesn't scan subfolders. Redshift Spectrum ignores hidden files and files that begin with a period, underscore, or hash mark ( . , _, or #) or end with a tilde (~).

  2. Create an external table and specify the partition key in the PARTITIONED BY clause.

    The partition key can't be the name of a table column. The data type can be any standard Amazon Redshift data type except TIMESTAMPTZ. You can use the DATE type only for a partition column, not for other external table columns.

  3. Add the partitions.

    Using ALTER TABLE … ADD PARTITION, add each partition, specifying the partition column and key value, and the location of the partition folder in Amazon S3. You can add only one partition in each ALTER TABLE statement. The following example adds partitions for '2008-01-01' and '2008-02-01'.

    Copy
    alter table spectrum.sales_part add partition(saledate='2008-01-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-01/'; alter table spectrum.sales_part add partition(saledate='2008-02-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/';

Partitioning Data Example

In this example, you'll create an external table that is partitioned by month.

The sample data for this example is located in an Amazon S3 buckets that gives read access to all authenticated AWS users. Your cluster and your external data files must be in the same region. The sample data bucket is in the US West (Oregon) Region (us-west-2). To access the data using Redshift Spectrum, your cluster must also be in us-west-2. To list the folders in Amazon S3, run the following command.

Copy
aws s3 ls s3://awssampledbuswest2/tickit/spectrum/sales_partition/
PRE saledate=2008-01/
   PRE saledate=2008-02/
   PRE saledate=2008-03/
   PRE saledate=2008-04/
   PRE saledate=2008-05/
   PRE saledate=2008-06/
   PRE saledate=2008-07/
   PRE saledate=2008-08/
   PRE saledate=2008-09/
   PRE saledate=2008-10/
   PRE saledate=2008-11/
   PRE saledate=2008-12/

If you don't already have an external schema, run the following command, substituting the Amazon Resource Name (ARN) for your AWS Identity and Access Management (IAM) role.

Copy
create external schema spectrum from data catalog database 'spectrumdb' iam_role 'arn:aws:iam::123456789012:role/myspectrumrole' create external database if not exists;

To create an external table partitioned by date, run the following command.

Copy
create external table spectrum.sales_part( salesid integer, listid integer, sellerid integer, buyerid integer, eventid integer, dateid smallint, qtysold smallint, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp) partitioned by (saledate date) row format delimited fields terminated by '|' stored as textfile location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/' table properties ('numRows'='172000');

To add the partitions, run the following ALTER TABLE commands.

Copy
alter table spectrum.sales_part add partition(saledate='2008-01-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-01/'; alter table spectrum.sales_part add partition(saledate='2008-02-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/'; alter table spectrum.sales_part add partition(saledate='2008-03-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-03/'; alter table spectrum.sales_part add partition(saledate='2008-04-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-04/'; alter table spectrum.sales_part add partition(saledate='2008-05-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-05/'; alter table spectrum.sales_part add partition(saledate='2008-06-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-06/'; alter table spectrum.sales_part add partition(saledate='2008-07-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-07/'; alter table spectrum.sales_part add partition(saledate='2008-08-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-08/'; alter table spectrum.sales_part add partition(saledate='2008-09-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-09/'; alter table spectrum.sales_part add partition(saledate='2008-10-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-10/'; alter table spectrum.sales_part add partition(saledate='2008-11-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-11/'; alter table spectrum.sales_part add partition(saledate='2008-12-01') location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-12/';

Run the following query to select data from the partitioned table.

Copy
select top 10 spectrum.sales_part.eventid, sum(spectrum.sales_part.pricepaid) from spectrum.sales_part, event where spectrum.sales_part.eventid = event.eventid and spectrum.sales_part.pricepaid > 30 and saledate = '2008-12-01' group by spectrum.sales_part.eventid order by 2 desc;
eventid | sum     
--------+---------
    914 | 36173.00
   5478 | 27303.00
   5061 | 26383.00
   4406 | 26252.00
   5324 | 24015.00
   1829 | 23911.00
   3601 | 23616.00
   3665 | 23214.00
   6069 | 22869.00
   5638 | 22551.00

To view external table partitions, query the SVV_EXTERNAL_PARTITIONS system view.

Copy
select schemaname, tablename, values, location from svv_external_partitions where tablename = 'sales_part';
schemaname | tablename  | values         | location                                         
-----------+------------+----------------+--------------------------------------------------
spectrum   | sales_part | ["2008-01-01"] | s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-01
spectrum   | sales_part | ["2008-02-01"] | s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02
spectrum   | sales_part | ["2008-03-01"] | s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-03
spectrum   | sales_part | ["2008-04-01"] | s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-04
spectrum   | sales_part | ["2008-05-01"] | s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-05
spectrum   | sales_part | ["2008-06-01"] | s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-06
spectrum   | sales_part | ["2008-07-01"] | s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-07
spectrum   | sales_part | ["2008-08-01"] | s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-08
spectrum   | sales_part | ["2008-09-01"] | s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-09
spectrum   | sales_part | ["2008-10-01"] | s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-10
spectrum   | sales_part | ["2008-11-01"] | s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-11
spectrum   | sales_part | ["2008-12-01"] | s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-12