Exemplos - Amazon Redshift

Exemplos

O exemplo a seguir cria uma tabela chamada SALES no esquema externo do Amazon Redshift denominado spectrum. Os dados estão em arquivos de texto delimitados por tabulação. A cláusula TABLE PROPERTIES define a propriedade numRows como 170.000 linhas.

Dependendo da identidade usada para executar CREATE EXTERNAL TABLE, pode haver permissões do IAM que você precisa configurar. Como prática recomendada, anexe políticas de permissões a um perfil do IAM e, depois, atribua-as a usuários e grupos, conforme necessário. Para obter mais informações, consulte Gerenciamento de identidade e acesso no Amazon Redshift.

create external table spectrum.sales( salesid integer, listid integer, sellerid integer, buyerid integer, eventid integer, saledate date, qtysold smallint, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp) row format delimited fields terminated by '\t' stored as textfile location 's3://redshift-downloads/tickit/spectrum/sales/' table properties ('numRows'='170000');

O exemplo a seguir cria uma tabela que usa o JsonSerDe para fazer referência aos dados em formato JSON.

create external table spectrum.cloudtrail_json ( event_version int, event_id bigint, event_time timestamp, event_type varchar(10), awsregion varchar(20), event_name varchar(max), event_source varchar(max), requesttime timestamp, useragent varchar(max), recipientaccountid bigint) row format serde 'org.openx.data.jsonserde.JsonSerDe' with serdeproperties ( 'dots.in.keys' = 'true', 'mapping.requesttime' = 'requesttimestamp' ) location 's3://mybucket/json/cloudtrail';

O exemplo de CREATE EXTERNAL TABLE AS a seguir cria uma tabela externa não particionada. Depois, ele grava o resultado da consulta SELECT como Apache Parquet no local de destino do Amazon S3.

CREATE EXTERNAL TABLE spectrum.lineitem STORED AS parquet LOCATION 'S3://mybucket/cetas/lineitem/' AS SELECT * FROM local_lineitem;

O exemplo a seguir cria uma tabela externa particionada e inclui as colunas de partição na consulta SELECT.

CREATE EXTERNAL TABLE spectrum.partitioned_lineitem PARTITIONED BY (l_shipdate, l_shipmode) STORED AS parquet LOCATION 'S3://mybucket/cetas/partitioned_lineitem/' AS SELECT l_orderkey, l_shipmode, l_shipdate, l_partkey FROM local_table;

Para obter uma lista de bancos de dados existentes no catálogo de dados externo, consulte a exibição de sistema SVV_EXTERNAL_DATABASES.

select eskind,databasename,esoptions from svv_external_databases order by databasename;
eskind | databasename | esoptions -------+--------------+---------------------------------------------------------------------------------- 1 | default | {"REGION":"us-west-2","IAM_ROLE":"arn:aws:iam::123456789012:role/mySpectrumRole"} 1 | sampledb | {"REGION":"us-west-2","IAM_ROLE":"arn:aws:iam::123456789012:role/mySpectrumRole"} 1 | spectrumdb | {"REGION":"us-west-2","IAM_ROLE":"arn:aws:iam::123456789012:role/mySpectrumRole"}

Para visualizar detalhes das tabelas externas, consulte as exibições SVV_EXTERNAL_TABLES e SVV_EXTERNAL_COLUMNS do sistema.

O exemplo a seguir consulta a exibição SVV_EXTERNAL_TABLES.

select schemaname, tablename, location from svv_external_tables;
schemaname | tablename | location -----------+----------------------+-------------------------------------------------------- spectrum | sales | s3://redshift-downloads/tickit/spectrum/sales spectrum | sales_part | s3://redshift-downloads/tickit/spectrum/sales_partition

O exemplo a seguir consulta a exibição SVV_EXTERNAL_COLUMNS.

select * from svv_external_columns where schemaname like 'spectrum%' and tablename ='sales';
schemaname | tablename | columnname | external_type | columnnum | part_key -----------+-----------+------------+---------------+-----------+--------- spectrum | sales | salesid | int | 1 | 0 spectrum | sales | listid | int | 2 | 0 spectrum | sales | sellerid | int | 3 | 0 spectrum | sales | buyerid | int | 4 | 0 spectrum | sales | eventid | int | 5 | 0 spectrum | sales | saledate | date | 6 | 0 spectrum | sales | qtysold | smallint | 7 | 0 spectrum | sales | pricepaid | decimal(8,2) | 8 | 0 spectrum | sales | commission | decimal(8,2) | 9 | 0 spectrum | sales | saletime | timestamp | 10 | 0

Para exibir as partições de tabela, use a consulta a seguir.

select schemaname, tablename, values, location from svv_external_partitions where tablename = 'sales_part';
schemaname | tablename | values | location -----------+------------+----------------+------------------------------------------------------------------------- spectrum | sales_part | ["2008-01-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01 spectrum | sales_part | ["2008-02-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02 spectrum | sales_part | ["2008-03-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-03 spectrum | sales_part | ["2008-04-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-04 spectrum | sales_part | ["2008-05-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-05 spectrum | sales_part | ["2008-06-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-06 spectrum | sales_part | ["2008-07-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-07 spectrum | sales_part | ["2008-08-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-08 spectrum | sales_part | ["2008-09-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-09 spectrum | sales_part | ["2008-10-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-10 spectrum | sales_part | ["2008-11-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-11 spectrum | sales_part | ["2008-12-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-12

O exemplo a seguir retorna o tamanho total de arquivos de dados relacionados de uma tabela externa.

select distinct "$path", "$size" from spectrum.sales_part; $path | $size --------------------------------------------------------------------------+------- s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01/ | 1616 s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02/ | 1444 s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02/ | 1444

Exemplos de particionamento

Para criar uma tabela externa particionada por data, execute o seguinte comando.

create external table spectrum.sales_part( salesid integer, listid integer, sellerid integer, buyerid integer, eventid integer, dateid smallint, qtysold smallint, pricepaid decimal(8,2), commission decimal(8,2), saletime timestamp) partitioned by (saledate date) row format delimited fields terminated by '|' stored as textfile location 's3://redshift-downloads/tickit/spectrum/sales_partition/' table properties ('numRows'='170000');

Para adicionar as partições, execute os seguintes comandos ALTER TABLE.

alter table spectrum.sales_part add if not exists partition (saledate='2008-01-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-02-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-03-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-03/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-04-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-04/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-05-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-05/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-06-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-06/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-07-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-07/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-08-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-08/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-09-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-09/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-10-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-10/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-11-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-11/'; alter table spectrum.sales_part add if not exists partition (saledate='2008-12-01') location 's3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-12/';

Para selecionar dados na tabela particionada, execute a consulta a seguir.

select top 10 spectrum.sales_part.eventid, sum(spectrum.sales_part.pricepaid) from spectrum.sales_part, event where spectrum.sales_part.eventid = event.eventid and spectrum.sales_part.pricepaid > 30 and saledate = '2008-12-01' group by spectrum.sales_part.eventid order by 2 desc;
eventid | sum --------+--------- 914 | 36173.00 5478 | 27303.00 5061 | 26383.00 4406 | 26252.00 5324 | 24015.00 1829 | 23911.00 3601 | 23616.00 3665 | 23214.00 6069 | 22869.00 5638 | 22551.00

Para visualizar as partições da tabela externa, consulte a exibição do sistema SVV_EXTERNAL_PARTITIONS.

select schemaname, tablename, values, location from svv_external_partitions where tablename = 'sales_part';
schemaname | tablename | values | location -----------+------------+----------------+-------------------------------------------------- spectrum | sales_part | ["2008-01-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-01 spectrum | sales_part | ["2008-02-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-02 spectrum | sales_part | ["2008-03-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-03 spectrum | sales_part | ["2008-04-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-04 spectrum | sales_part | ["2008-05-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-05 spectrum | sales_part | ["2008-06-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-06 spectrum | sales_part | ["2008-07-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-07 spectrum | sales_part | ["2008-08-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-08 spectrum | sales_part | ["2008-09-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-09 spectrum | sales_part | ["2008-10-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-10 spectrum | sales_part | ["2008-11-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-11 spectrum | sales_part | ["2008-12-01"] | s3://redshift-downloads/tickit/spectrum/sales_partition/saledate=2008-12

Exemplos de formato de linha

Este é um exemplo da especificação dos parâmetros ROW FORMAT SERDE para arquivos de dados armazenados no formato AVRO.

create external table spectrum.sales(salesid int, listid int, sellerid int, buyerid int, eventid int, dateid int, qtysold int, pricepaid decimal(8,2), comment VARCHAR(255)) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'='{\"namespace\": \"dory.sample\",\"name\": \"dory_avro\",\"type\": \"record\", \"fields\": [{\"name\":\"salesid\", \"type\":\"int\"}, {\"name\":\"listid\", \"type\":\"int\"}, {\"name\":\"sellerid\", \"type\":\"int\"}, {\"name\":\"buyerid\", \"type\":\"int\"}, {\"name\":\"eventid\",\"type\":\"int\"}, {\"name\":\"dateid\",\"type\":\"int\"}, {\"name\":\"qtysold\",\"type\":\"int\"}, {\"name\":\"pricepaid\", \"type\": {\"type\": \"bytes\", \"logicalType\": \"decimal\", \"precision\": 8, \"scale\": 2}}, {\"name\":\"comment\",\"type\":\"string\"}]}') STORED AS AVRO location 's3://mybucket/avro/sales' ;

O exemplo a seguir mostra como especificar os parâmetros ROW FORMAT SERDE usando RegEx.

create external table spectrum.types( cbigint bigint, cbigint_null bigint, cint int, cint_null int) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with serdeproperties ('input.regex'='([^\\x01]+)\\x01([^\\x01]+)\\x01([^\\x01]+)\\x01([^\\x01]+)') stored as textfile location 's3://mybucket/regex/types';

O exemplo a seguir mostra como especificar os parâmetros ROW FORMAT SERDE usando Grok.

create external table spectrum.grok_log( timestamp varchar(255), pid varchar(255), loglevel varchar(255), progname varchar(255), message varchar(255)) row format serde 'com.amazonaws.glue.serde.GrokSerDe' with serdeproperties ('input.format'='[DFEWI], \\[%{TIMESTAMP_ISO8601:timestamp} #%{POSINT:pid:int}\\] *(?<loglevel>:DEBUG|FATAL|ERROR|WARN|INFO) -- +%{DATA:progname}: %{GREEDYDATA:message}') stored as textfile location 's3://mybucket/grok/logs';

A tabela a seguir mostra um exemplo de definição de um log de acesso ao servidor do Amazon S3 em um bucket do S3. Use o Redshift Spectrum para consultar logs de acesso do Amazon S3.

CREATE EXTERNAL TABLE spectrum.mybucket_s3_logs( bucketowner varchar(255), bucket varchar(255), requestdatetime varchar(2000), remoteip varchar(255), requester varchar(255), requested varchar(255), operation varchar(255), key varchar(255), requesturi_operation varchar(255), requesturi_key varchar(255), requesturi_httpprotoversion varchar(255), httpstatus varchar(255), errorcode varchar(255), bytessent bigint, objectsize bigint, totaltime varchar(255), turnaroundtime varchar(255), referrer varchar(255), useragent varchar(255), versionid varchar(255) ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^ ]*)\\s*([^ ]*)\\s*([^ ]*)\" (- |[^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\") ([^ ]*).*$') LOCATION 's3://mybucket/s3logs’;

Este é um exemplo da especificação dos parâmetros ROW FORMAT SERDE para dados no formato ION.

CREATE EXTERNAL TABLE tbl_name (columns) ROW FORMAT SERDE 'com.amazon.ionhiveserde.IonHiveSerDe' STORED AS INPUTFORMAT 'com.amazon.ionhiveserde.formats.IonInputFormat' OUTPUTFORMAT 'com.amazon.ionhiveserde.formats.IonOutputFormat' LOCATION 's3://s3-bucket/prefix'

Exemplos de tratamento de dados

Os exemplos a seguir acessam o arquivo: spi_global_rankings.csv. Você pode carregar o arquivo spi_global_rankings.csv em um bucket do Amazon S3 para experimentar esses exemplos.

O exemplo a seguir cria o esquema externo schema_spectrum_uddh e o banco de dados externo spectrum_db_uddh. Para aws-account-id, insira o ID da conta da AWS e, para role-name, insira seu nome de função do Redshift Spectrum.

create external schema schema_spectrum_uddh from data catalog database 'spectrum_db_uddh' iam_role 'arn:aws:iam::aws-account-id:role/role-name' create external database if not exists;

O exemplo a seguir cria uma tabela externa soccer_league no esquema externo schema_spectrum_uddh.

CREATE EXTERNAL TABLE schema_spectrum_uddh.soccer_league ( league_rank smallint, prev_rank smallint, club_name varchar(15), league_name varchar(20), league_off decimal(6,2), league_def decimal(6,2), league_spi decimal(6,2), league_nspi integer ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n\l' stored as textfile LOCATION 's3://spectrum-uddh/league/' table properties ('skip.header.line.count'='1');

Confira o número de linhas da tabela soccer_league.

select count(*) from schema_spectrum_uddh.soccer_league;

Os números de linhas são exibidos.

count 645

A consulta a seguir exibe os 10 principais clubes. Como clube Barcelona tem um caractere inválido na string, um NULL é exibido para o nome.

select league_rank,club_name,league_name,league_nspi from schema_spectrum_uddh.soccer_league where league_rank between 1 and 10;
league_rank club_name league_name league_nspi 1 Manchester City Barclays Premier Lea 34595 2 Bayern Munich German Bundesliga 34151 3 Liverpool Barclays Premier Lea 33223 4 Chelsea Barclays Premier Lea 32808 5 Ajax Dutch Eredivisie 32790 6 Atletico Madrid Spanish Primera Divi 31517 7 Real Madrid Spanish Primera Divi 31469 8 NULL Spanish Primera Divi 31321 9 RB Leipzig German Bundesliga 31014 10 Paris Saint-Ger French Ligue 1 30929

O exemplo a seguir altera a tabela soccer_league para especificar as propriedades invalid_char_handling, replacement_char e data_cleansing_enabled da tabela externa e inserir um ponto de interrogação (?) como substituto de caracteres inesperados.

alter table schema_spectrum_uddh.soccer_league set table properties ('invalid_char_handling'='REPLACE','replacement_char'='?','data_cleansing_enabled'='true');

O exemplo a seguir consulta a tabela soccer_league para times com classificação de 1 a 10.

select league_rank,club_name,league_name,league_nspi from schema_spectrum_uddh.soccer_league where league_rank between 1 and 10;

Como as propriedades da tabela foram alteradas, os resultados mostram os dez principais clubes, com o ponto de interrogação (?) como caractere substituto na oitava linha para o clube Barcelona.

league_rank club_name league_name league_nspi 1 Manchester City Barclays Premier Lea 34595 2 Bayern Munich German Bundesliga 34151 3 Liverpool Barclays Premier Lea 33223 4 Chelsea Barclays Premier Lea 32808 5 Ajax Dutch Eredivisie 32790 6 Atletico Madrid Spanish Primera Divi 31517 7 Real Madrid Spanish Primera Divi 31469 8 Barcel?na Spanish Primera Divi 31321 9 RB Leipzig German Bundesliga 31014 10 Paris Saint-Ger French Ligue 1 30929

O exemplo a seguir altera a tabela soccer_league para especificar que as propriedades invalid_char_handling da tabela externa descartem as linhas com caracteres inesperados.

alter table schema_spectrum_uddh.soccer_league set table properties ('invalid_char_handling'='DROP_ROW','data_cleansing_enabled'='true');

O exemplo a seguir consulta a tabela soccer_league para times com classificação de 1 a 10.

select league_rank,club_name,league_name,league_nspi from schema_spectrum_uddh.soccer_league where league_rank between 1 and 10;

Os resultados exibem os principais clubes, sem incluir a oitava linha para o clube Barcelona.

league_rank club_name league_name league_nspi 1 Manchester City Barclays Premier Lea 34595 2 Bayern Munich German Bundesliga 34151 3 Liverpool Barclays Premier Lea 33223 4 Chelsea Barclays Premier Lea 32808 5 Ajax Dutch Eredivisie 32790 6 Atletico Madrid Spanish Primera Divi 31517 7 Real Madrid Spanish Primera Divi 31469 9 RB Leipzig German Bundesliga 31014 10 Paris Saint-Ger French Ligue 1 30929