示例 - Amazon Redshift

示例

以下示例在名为 spectrum 的 Amazon Redshift 外部 schema 中创建一个名为 SALES 的表。数据位于制表符分隔的文本文件中。TABLE PROPERTIES 子句将 numRows 属性设置为 170000 行。

根据您用于运行 CREATE EXTERNAL TABLE 的身份,可能需要配置 IAM 权限。作为最佳实践,我们建议将权限策略附加到 IAM 角色,然后根据需要将其分配给用户和组。有关更多信息,请参阅 Amazon Redshift 中的 Identity and Access Management

create external table spectrum.sales( salesid integer, listid integer, sellerid integer, buyerid integer, eventid integer, saledate date, qtysold smallint, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp) row format delimited fields terminated by '\t' stored as textfile location 's3://redshift-downloads/tickit/spectrum/sales/' table properties ('numRows'='170000');

以下示例创建一个使用 JsonSerDe 以 JSON 格式引用数据的表。

create external table spectrum.cloudtrail_json ( event_version int, event_id bigint, event_time timestamp, event_type varchar(10), awsregion varchar(20), event_name varchar(max), event_source varchar(max), requesttime timestamp, useragent varchar(max), recipientaccountid bigint) row format serde 'org.openx.data.jsonserde.JsonSerDe' with serdeproperties ( 'dots.in.keys' = 'true', 'mapping.requesttime' = 'requesttimestamp' ) location 's3://mybucket/json/cloudtrail';

以下 CREATE EXTERNAL TABLE AS 示例创建一个未分区的外部表。然后,它将 SELECT 查询的结果以 Apache Parquet 格式写入到目标 Amazon S3 位置。

CREATE EXTERNAL TABLE spectrum.lineitem STORED AS parquet LOCATION 'S3://mybucket/cetas/lineitem/' AS SELECT * FROM local_lineitem;

以下示例创建分区的外部表,并在 SELECT 查询中包含分区列。

CREATE EXTERNAL TABLE spectrum.partitioned_lineitem PARTITIONED BY (l_shipdate, l_shipmode) STORED AS parquet LOCATION 'S3://mybucket/cetas/partitioned_lineitem/' AS SELECT l_orderkey, l_shipmode, l_shipdate, l_partkey FROM local_table;

如需外部数据目录中的现有数据库的列表,请查询 SVV_EXTERNAL_DATABASES 系统视图。

select eskind,databasename,esoptions from svv_external_databases order by databasename;
eskind | databasename | esoptions -------+--------------+---------------------------------------------------------------------------------- 1 | default | {"REGION":"us-west-2","IAM_ROLE":"arn:aws:iam::123456789012:role/mySpectrumRole"} 1 | sampledb | {"REGION":"us-west-2","IAM_ROLE":"arn:aws:iam::123456789012:role/mySpectrumRole"} 1 | spectrumdb | {"REGION":"us-west-2","IAM_ROLE":"arn:aws:iam::123456789012:role/mySpectrumRole"}

要查看外部表的详细信息,请查询 SVV_EXTERNAL_TABLESSVV_EXTERNAL_COLUMNS 系统视图。

以下示例将查询 SVV_EXTERNAL_TABLES 视图。

select schemaname, tablename, location from svv_external_tables;
schemaname | tablename | location -----------+----------------------+-------------------------------------------------------- spectrum | sales | s3://redshift-downloads/tickit/spectrum/sales spectrum | sales_part | s3://redshift-downloads/tickit/spectrum/sales_partition

以下示例将查询 SVV_EXTERNAL_COLUMNS 视图。

select * from svv_external_columns where schemaname like 'spectrum%' and tablename ='sales';
schemaname | tablename | columnname | external_type | columnnum | part_key -----------+-----------+------------+---------------+-----------+--------- spectrum | sales | salesid | int | 1 | 0 spectrum | sales | listid | int | 2 | 0 spectrum | sales | sellerid | int | 3 | 0 spectrum | sales | buyerid | int | 4 | 0 spectrum | sales | eventid | int | 5 | 0 spectrum | sales | saledate | date | 6 | 0 spectrum | sales | qtysold | smallint | 7 | 0 spectrum | sales | pricepaid | decimal(8,2) | 8 | 0 spectrum | sales | commission | decimal(8,2) | 9 | 0 spectrum | sales | saletime | timestamp | 10 | 0

要查看表分区,请使用以下查询。

select schemaname, tablename, values, location from svv_external_partitions where tablename = 'sales_part';
schemaname | tablename | values | location -----------+------------+----------------+------------------------------------------------------------------------- spectrum | sales_part | ["2008-01-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01 spectrum | sales_part | ["2008-02-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02 spectrum | sales_part | ["2008-03-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-03 spectrum | sales_part | ["2008-04-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-04 spectrum | sales_part | ["2008-05-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-05 spectrum | sales_part | ["2008-06-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-06 spectrum | sales_part | ["2008-07-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-07 spectrum | sales_part | ["2008-08-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-08 spectrum | sales_part | ["2008-09-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-09 spectrum | sales_part | ["2008-10-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-10 spectrum | sales_part | ["2008-11-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-11 spectrum | sales_part | ["2008-12-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-12

以下示例将为外部表返回相关数据文件的总大小。

select distinct "$path", "$size" from spectrum.sales_part; $path | $size --------------------------------------------------------------------------+------- s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01/ | 1616 s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02/ | 1444 s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02/ | 1444

分区示例

要创建按日期分区的外部表,请运行以下命令。

create external table spectrum.sales_part( salesid integer, listid integer, sellerid integer, buyerid integer, eventid integer, dateid smallint, qtysold smallint, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp) partitioned by (saledate date) row format delimited fields terminated by '|' stored as textfile location 's3://redshift-downloads/tickit/spectrum/sales_partition/' table properties ('numRows'='170000');

要添加分区,请运行以下 ALTER TABLE 命令。

alter table spectrum.sales_part add if not exists partition (saledate='2008-01-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-02-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-03-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-03/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-04-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-04/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-05-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-05/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-06-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-06/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-07-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-07/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-08-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-08/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-09-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-09/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-10-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-10/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-11-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-11/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-12-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-12/';

要从分区表中选择数据,请运行以下查询。

select top 10 spectrum.sales_part.eventid, sum(spectrum.sales_part.pricepaid) from spectrum.sales_part, event where spectrum.sales_part.eventid = event.eventid and spectrum.sales_part.pricepaid > 30 and saledate = '2008-12-01' group by spectrum.sales_part.eventid order by 2 desc;
eventid | sum --------+--------- 914 | 36173.00 5478 | 27303.00 5061 | 26383.00 4406 | 26252.00 5324 | 24015.00 1829 | 23911.00 3601 | 23616.00 3665 | 23214.00 6069 | 22869.00 5638 | 22551.00

要查看外部表分区,请查询 SVV_EXTERNAL_PARTITIONS 系统视图。

select schemaname, tablename, values, location from svv_external_partitions where tablename = 'sales_part';
schemaname | tablename | values | location -----------+------------+----------------+-------------------------------------------------- spectrum | sales_part | ["2008-01-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01 spectrum | sales_part | ["2008-02-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02 spectrum | sales_part | ["2008-03-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-03 spectrum | sales_part | ["2008-04-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-04 spectrum | sales_part | ["2008-05-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-05 spectrum | sales_part | ["2008-06-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-06 spectrum | sales_part | ["2008-07-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-07 spectrum | sales_part | ["2008-08-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-08 spectrum | sales_part | ["2008-09-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-09 spectrum | sales_part | ["2008-10-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-10 spectrum | sales_part | ["2008-11-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-11 spectrum | sales_part | ["2008-12-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-12

行格式示例

下面显示为以 AVRO 格式存储的数据文件指定 ROW FORMAT SERDE 参数的示例。

create external table spectrum.sales(salesid int, listid int, sellerid int, buyerid int, eventid int, dateid int, qtysold int, pricepaid decimal(8,2), comment VARCHAR(255)) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{\"namespace\": \"dory.sample\",\"name\": \"dory_avro\",\"type\": \"record\", \"fields\": [{\"name\":\"salesid\", \"type\":\"int\"}, {\"name\":\"listid\", \"type\":\"int\"}, {\"name\":\"sellerid\", \"type\":\"int\"}, {\"name\":\"buyerid\", \"type\":\"int\"}, {\"name\":\"eventid\",\"type\":\"int\"}, {\"name\":\"dateid\",\"type\":\"int\"}, {\"name\":\"qtysold\",\"type\":\"int\"}, {\"name\":\"pricepaid\", \"type\": {\"type\": \"bytes\", \"logicalType\": \"decimal\", \"precision\": 8, \"scale\": 2}}, {\"name\":\"comment\",\"type\":\"string\"}]}') STORED AS AVRO location 's3://mybucket/avro/sales' ;

下面显示了使用 RegEx 指定 ROW FORMAT SERDE 参数的示例。

create external table spectrum.types( cbigint bigint, cbigint_null bigint, cint int, cint_null int) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties ('input.regex'='([^\\x01]+)\\x01([^\\x01]+)\\x01([^\\x01]+)\\x01([^\\x01]+)') stored as textfile location 's3://mybucket/regex/types';

下面显示了使用 Grok 指定 ROW FORMAT SERDE 参数的示例。

create external table spectrum.grok_log( timestamp varchar(255), pid varchar(255), loglevel varchar(255), progname varchar(255), message varchar(255)) row format serde 'com.amazonaws.glue.serde.GrokSerDe' with serdeproperties ('input.format'='[DFEWI], \\[%{TIMESTAMP_ISO8601:timestamp} #%{POSINT:pid:int}\\] *(?<loglevel>:DEBUG|FATAL|ERROR|WARN|INFO) -- +%{DATA:progname}: %{GREEDYDATA:message}') stored as textfile location 's3://mybucket/grok/logs';

下面显示了一个有关在 S3 桶中定义 Amazon S3 服务器访问日志的示例。您可以使用 Redshift Spectrum 查询 Amazon S3 访问日志。

CREATE EXTERNAL TABLE spectrum.mybucket_s3_logs( bucketowner varchar(255), bucket varchar(255), requestdatetime varchar(2000), remoteip varchar(255), requester varchar(255), requested varchar(255), operation varchar(255), key varchar(255), requesturi_operation varchar(255), requesturi_key varchar(255), requesturi_httpprotoversion varchar(255), httpstatus varchar(255), errorcode varchar(255), bytessent bigint, objectsize bigint, totaltime varchar(255), turnaroundtime varchar(255), referrer varchar(255), useragent varchar(255), versionid varchar(255) ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^ ]*)\\s*([^ ]*)\\s*([^ ]*)\" (- |[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\") ([^ ]*).*$') LOCATION 's3://mybucket/s3logs’;

以下示例为 ION 格式的数据指定了 ROW FORMAT SERDE 参数。

CREATE EXTERNAL TABLE tbl_name (columns) ROW FORMAT SERDE 'com.amazon.ionhiveserde.IonHiveSerDe' STORED AS INPUTFORMAT 'com.amazon.ionhiveserde.formats.IonInputFormat' OUTPUTFORMAT 'com.amazon.ionhiveserde.formats.IonOutputFormat' LOCATION 's3://s3-bucket/prefix'

数据处理示例

以下示例访问该文件:spi_global_rankings.csv。您可以将 spi_global_rankings.csv 文件上载到 Amazon S3 桶以尝试这些示例。

以下示例创建外部架构 schema_spectrum_uddh 和数据库 spectrum_db_uddh。对于 aws-account-id,请输入您的 AWS 账户 ID,而对于 role-name,请输入您的 Redshift Spectrum 角色名称。

create external schema schema_spectrum_uddh from data catalog database 'spectrum_db_uddh' iam_role 'arn:aws:iam::aws-account-id:role/role-name' create external database if not exists;

以下示例在外部架构 schema_spectrum_uddh 中创建外部表 soccer_league

CREATE EXTERNAL TABLE schema_spectrum_uddh.soccer_league ( league_rank smallint, prev_rank smallint, club_name varchar(15), league_name varchar(20), league_off decimal(6,2), league_def decimal(6,2), league_spi decimal(6,2), league_nspi integer ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n\l' stored as textfile LOCATION 's3://spectrum-uddh/league/' table properties ('skip.header.line.count'='1');

请检查 soccer_league 表中的行数。

select count(*) from schema_spectrum_uddh.soccer_league;

此时将显示行数。

count 645

以下查询显示前 10 个俱乐部。由于俱乐部 Barcelona 字符串中包含无效字符,因此对该名称显示 NULL。

select league_rank,club_name,league_name,league_nspi from schema_spectrum_uddh.soccer_league where league_rank between 1 and 10;
league_rank club_name league_name league_nspi 1 Manchester City Barclays Premier Lea 34595 2 Bayern Munich German Bundesliga 34151 3 Liverpool Barclays Premier Lea 33223 4 Chelsea Barclays Premier Lea 32808 5 Ajax Dutch Eredivisie 32790 6 Atletico Madrid Spanish Primera Divi 31517 7 Real Madrid Spanish Primera Divi 31469 8 NULL Spanish Primera Divi 31321 9 RB Leipzig German Bundesliga 31014 10 Paris Saint-Ger French Ligue 1 30929

以下示例更改了 soccer_league 表,以指定用于插入一个问号 (?) 来替换意外字符的 invalid_char_handlingreplacement_chardata_cleansing_enabled 外部表属性。

alter table schema_spectrum_uddh.soccer_league set table properties ('invalid_char_handling'='REPLACE','replacement_char'='?','data_cleansing_enabled'='true');

以下示例将查询排名从 1 到 10 的团队的表 soccer_league

select league_rank,club_name,league_name,league_nspi from schema_spectrum_uddh.soccer_league where league_rank between 1 and 10;

由于表属性已更改,结果显示了前 10 位俱乐部,在第八行中对于俱乐部 Barcelona 采用问号 (?) 替换字符。

league_rank club_name league_name league_nspi 1 Manchester City Barclays Premier Lea 34595 2 Bayern Munich German Bundesliga 34151 3 Liverpool Barclays Premier Lea 33223 4 Chelsea Barclays Premier Lea 32808 5 Ajax Dutch Eredivisie 32790 6 Atletico Madrid Spanish Primera Divi 31517 7 Real Madrid Spanish Primera Divi 31469 8 Barcel?na Spanish Primera Divi 31321 9 RB Leipzig German Bundesliga 31014 10 Paris Saint-Ger French Ligue 1 30929

以下示例更改了 soccer_league 表,以指定用于剔除包含意外字符的行的 invalid_char_handling 外部表属性。

alter table schema_spectrum_uddh.soccer_league set table properties ('invalid_char_handling'='DROP_ROW','data_cleansing_enabled'='true');

以下示例将查询排名从 1 到 10 的团队的表 soccer_league

select league_rank,club_name,league_name,league_nspi from schema_spectrum_uddh.soccer_league where league_rank between 1 and 10;

结果显示排名靠前的俱乐部,不包括对应于俱乐部 Barcelona 的第八行。

league_rank club_name league_name league_nspi 1 Manchester City Barclays Premier Lea 34595 2 Bayern Munich German Bundesliga 34151 3 Liverpool Barclays Premier Lea 33223 4 Chelsea Barclays Premier Lea 32808 5 Ajax Dutch Eredivisie 32790 6 Atletico Madrid Spanish Primera Divi 31517 7 Real Madrid Spanish Primera Divi 31469 9 RB Leipzig German Bundesliga 31014 10 Paris Saint-Ger French Ligue 1 30929