Unloading semistructured data
With Amazon Redshift, you can export semistructured data from your Amazon Redshift cluster to Amazon S3 in a variety of formats, including text, Apache Parquet, Apache ORC, and Avro. The following sections will guide you through the process of configuring and executing unload operations for your semistructured data in Amazon Redshift.
- CSV or text formats
-
You can unload tables with SUPER data columns to Amazon S3 in a comma-separated value (CSV) or text format. Using a combination of navigation and unnest clauses, Amazon Redshift unloads hierarchical data in SUPER data format to Amazon S3 in CSV or text formats. Subsequently, you can create external tables against unloaded data and query them using Redshift Spectrum. For information on using UNLOAD and the required IAM permissions, see UNLOAD.
Before running the following example, populate the region_nations table using the processes in Loading semistructured data into Amazon Redshift. For information on the tables used in the following example, see SUPER sample dataset.
The following example unloads data into Amazon S3.
UNLOAD ('SELECT * FROM region_nations') TO 's3://xxxxxx/' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3-Write' DELIMITER AS '|' GZIP ALLOWOVERWRITE;
Unlike other data types where a user-defined string represents a null value, Amazon Redshift exports the SUPER data columns using the JSON format and represents it as null as determined by the JSON format. As a result, SUPER data columns ignore the NULL [AS] option used in UNLOAD commands.
- Parquet format
-
You can unload tables with SUPER data columns to Amazon S3 in the Parquet format. Amazon Redshift represents SUPER columns in Parquet as the JSON data type. This enables semistructured data to be represented in Parquet. You can query these columns using Redshift Spectrum or ingest them back to Amazon Redshift using the COPY command. For information on using UNLOAD and the required IAM permissions, see UNLOAD.
The following example unloads data into Amazon S3 in the Parquet format.
UNLOAD ('SELECT * FROM region_nations') TO 's3://xxxxxx/' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3-Write' FORMAT PARQUET;