UNLOAD - Amazon Redshift

UNLOAD

Unloads the result of a query to one or more text or Apache Parquet files on Amazon S3, using Amazon S3 server-side encryption (SSE-S3). You can also specify server-side encryption with an AWS Key Management Service key (SSE-KMS) or client-side encryption with a customer-managed key (CSE-CMK).

You can manage the size of files on Amazon S3, and by extension the number of files, by setting the MAXFILESIZE parameter.

You can unload the result of an Amazon Redshift query to your Amazon S3 data lake in Apache Parquet, an efficient open columnar storage format for analytics. Parquet format is up to 2x faster to unload and consumes up to 6x less storage in Amazon S3, compared with text formats. This enables you to save data transformation and enrichment you have done in Amazon S3 into your Amazon S3 data lake in an open format. You can then analyze your data with Redshift Spectrum and other AWS services such as Amazon Athena, Amazon EMR, and SageMaker.

Syntax

UNLOAD ('select-statement') TO 's3://object-path/name-prefix' authorization [ option [ ... ] ] where option is { [ FORMAT [ AS ] ] CSV | PARQUET | PARTITION BY ( column_name [, ... ] ) [ INCLUDE ] | MANIFEST [ VERBOSE ] | HEADER | DELIMITER [ AS ] 'delimiter-char' | FIXEDWIDTH [ AS ] 'fixedwidth-spec' | ENCRYPTED [ AUTO ] | BZIP2 | GZIP | ZSTD | ADDQUOTES | NULL [ AS ] 'null-string' | ESCAPE | ALLOWOVERWRITE | PARALLEL [ { ON | TRUE } | { OFF | FALSE } ] | MAXFILESIZE [AS] max-size [ MB | GB ] | REGION [AS] 'aws-region' }

Parameters

('select-statement')

A SELECT query. The results of the query are unloaded. In most cases, it is worthwhile to unload data in sorted order by specifying an ORDER BY clause in the query. This approach saves the time required to sort the data when it is reloaded.

The query must be enclosed in single quotation marks as shown following:

('select * from venue order by venueid')
Note

If your query contains quotation marks (for example to enclose literal values), put the literal between two sets of single quotation marks—you must also enclose the query between single quotation marks:

('select * from venue where venuestate=''NV''')
TO 's3://object-path/name-prefix'

The full path, including bucket name, to the location on Amazon S3 where Amazon Redshift writes the output file objects, including the manifest file if MANIFEST is specified. The object names are prefixed with name-prefix. If you use PARTITION BY, a forward slash (/) is automatically added to the end of the name-prefix value if needed. For added security, UNLOAD connects to Amazon S3 using an HTTPS connection. By default, UNLOAD writes one or more files per slice. UNLOAD appends a slice number and part number to the specified name prefix as follows:

<object-path>/<name-prefix><slice-number>_part_<part-number>.

If MANIFEST is specified, the manifest file is written as follows:

<object_path>/<name_prefix>manifest.

UNLOAD automatically creates encrypted files using Amazon S3 server-side encryption (SSE), including the manifest file if MANIFEST is used. The COPY command automatically reads server-side encrypted files during the load operation. You can transparently download server-side encrypted files from your bucket using either the Amazon S3 Management Console or API. For more information, go to Protecting Data Using Server-Side Encryption.

To use Amazon S3 client-side encryption, specify the ENCRYPTED option.

Important

REGION is required when the Amazon S3 bucket isn't in the same AWS Region as the Amazon Redshift cluster.

Authorization

The UNLOAD command needs authorization to write data to Amazon S3. The UNLOAD command uses the same parameters the COPY command uses for authorization. For more information, see Authorization parameters in the COPY command syntax reference.

[ FORMAT [AS] ] CSV | PARQUET

When CSV, unloads to a text file in CSV format using a comma ( , ) character as the default delimiter. If a field contains delimiters, double quotation marks, newline characters, or carriage returns, then the field in the unloaded file is enclosed in double quotation marks. A double quotation mark within a data field is escaped by an additional double quotation mark.

When PARQUET, unloads to a file in Apache Parquet version 1.0 format. By default, each row group is compressed using SNAPPY compression. For more information about Apache Parquet format, see Parquet.

The FORMAT and AS keywords are optional. You can't use CSV with FIXEDWIDTH. You can't use PARQUET with DELIMITER, FIXEDWIDTH, ADDQUOTES, ESCAPE, NULL AS, HEADER, GZIP, BZIP2, or ZSTD. PARQUET with ENCRYPTED is only supported with server-side encryption with an AWS Key Management Service key (SSE-KMS).

PARTITION BY ( column_name [, ... ] ) [INCLUDE]

Specifies the partition keys for the unload operation. UNLOAD automatically partitions output files into partition folders based on the partition key values, following the Apache Hive convention. For example, a Parquet file that belongs to the partition year 2019 and the month September has the following prefix: s3://my_bucket_name/my_prefix/year=2019/month=September/000.parquet.

The value for column_name must be a column in the query results being unloaded.

If you specify PARTITION BY with the INCLUDE option, partition columns aren't removed from the unloaded files.

Amazon Redshift doesn't support string literals in PARTITION BY clauses.

MANIFEST [ VERBOSE ]

Creates a manifest file that explicitly lists details for the data files that are created by the UNLOAD process. The manifest is a text file in JSON format that lists the URL of each file that was written to Amazon S3.

If MANIFEST is specified with the VERBOSE option, the manifest includes the following details:

  • The column names and data types, and for CHAR, VARCHAR, or NUMERIC data types, dimensions for each column. For CHAR and VARCHAR data types, the dimension is the length. For a DECIMAL or NUMERIC data type, the dimensions are precision and scale.

  • The row count unloaded to each file. If the HEADER option is specified, the row count includes the header line.

  • The total file size of all files unloaded and the total row count unloaded to all files. If the HEADER option is specified, the row count includes the header lines.

  • The author. Author is always "Amazon Redshift".

You can specify VERBOSE only following MANIFEST.

The manifest file is written to the same Amazon S3 path prefix as the unload files in the format <object_path_prefix>manifest. For example, if UNLOAD specifies the Amazon S3 path prefix 's3://mybucket/venue_', the manifest file location is 's3://mybucket/venue_manifest'.

HEADER

Adds a header line containing column names at the top of each output file. Text transformation options, such as CSV, DELIMITER, ADDQUOTES, and ESCAPE, also apply to the header line. You can't use HEADER with FIXEDWIDTH.

DELIMITER AS 'delimiter_character'

Specifies a single ASCII character that is used to separate fields in the output file, such as a pipe character ( | ), a comma ( , ), or a tab ( \t ). The default delimiter for text files is a pipe character. The default delimiter for CSV files is a comma character. The AS keyword is optional. You can't use DELIMITER with FIXEDWIDTH. If the data contains the delimiter character, you need to specify the ESCAPE option to escape the delimiter, or use ADDQUOTES to enclose the data in double quotation marks. Alternatively, specify a delimiter that isn't contained in the data.

FIXEDWIDTH 'fixedwidth_spec'

Unloads the data to a file where each column width is a fixed length, rather than separated by a delimiter. The fixedwidth_spec is a string that specifies the number of columns and the width of the columns. The AS keyword is optional. Because FIXEDWIDTH doesn't truncate data, the specification for each column in the UNLOAD statement needs to be at least as long as the length of the longest entry for that column. The format for fixedwidth_spec is shown below:

'colID1:colWidth1,colID2:colWidth2, ...'

You can't use FIXEDWIDTH with DELIMITER or HEADER.

ENCRYPTED [AUTO]

Specifies that the output files on Amazon S3 are encrypted using Amazon S3 server-side encryption or client-side encryption. If MANIFEST is specified, the manifest file is also encrypted. For more information, see Unloading encrypted data files. If you don't specify the ENCRYPTED parameter, UNLOAD automatically creates encrypted files using Amazon S3 server-side encryption with AWS-managed encryption keys (SSE-S3).

For ENCRYPTED, you might want to unload to Amazon S3 using server-side encryption with an AWS KMS key (SSE-KMS). If so, use the KMS_KEY_ID parameter to provide the key ID. You can't use the CREDENTIALS parameter with the KMS_KEY_ID parameter. If you run an UNLOAD command for data using KMS_KEY_ID, you can then do a COPY operation for the same data without specifying a key.

To unload to Amazon S3 using client-side encryption with a customer-supplied symmetric key (CSE-CMK), provide the key in one of two ways. To provide the key, use the MASTER_SYMMETRIC_KEY parameter or the master_symmetric_key portion of a CREDENTIALS credential string. If you unload data using a master symmetric key, make sure that you supply the same key when you perform a COPY operation for the encrypted data.

UNLOAD doesn't support Amazon S3 server-side encryption with a customer-supplied key (SSE-C).

If ENCRYPTED AUTO is used, the UNLOAD command fetches the default KMS encryption key on the target Amazon S3 bucket property and encrypts the files written to Amazon S3 with the KMS key. If the bucket doesn't have the default KMS encryption key, UNLOAD automatically creates encrypted files using Amazon Redshift server-side encryption with AWS-managed encryption keys (SSE-S3). You can't use this option with KMS_KEY_ID, MASTER_SYMMETRIC_KEY, or CREDENTIALS that contains master_symmetric_key.

KMS_KEY_ID 'key-id'

Specifies the key ID for an AWS Key Management Service (AWS KMS) key to be used to encrypt data files on Amazon S3. For more information, see What is AWS Key Management Service? If you specify KMS_KEY_ID, you must specify the ENCRYPTED parameter also. If you specify KMS_KEY_ID, you can't authenticate using the CREDENTIALS parameter. Instead, use either IAM_ROLE or ACCESS_KEY_ID and SECRET_ACCESS_KEY.

MASTER_SYMMETRIC_KEY 'master_key'

Specifies the master symmetric key to be used to encrypt data files on Amazon S3. If you specify MASTER_SYMMETRIC_KEY, you must specify the ENCRYPTED parameter also. You can't use MASTER_SYMMETRIC_KEY with the CREDENTIALS parameter. For more information, see Loading encrypted data files from Amazon S3.

BZIP2

Unloads data to one or more bzip2-compressed files per slice. Each resulting file is appended with a .bz2 extension.

GZIP

Unloads data to one or more gzip-compressed files per slice. Each resulting file is appended with a .gz extension.

ZSTD

Unloads data to one or more Zstandard-compressed files per slice. Each resulting file is appended with a .zst extension.

ADDQUOTES

Places quotation marks around each unloaded data field, so that Amazon Redshift can unload data values that contain the delimiter itself. For example, if the delimiter is a comma, you could unload and reload the following data successfully:

"1","Hello, World"

Without the added quotation marks, the string Hello, World would be parsed as two separate fields.

If you use ADDQUOTES, you must specify REMOVEQUOTES in the COPY if you reload the data.

NULL AS 'null-string'

Specifies a string that represents a null value in unload files. If this option is used, all output files contain the specified string in place of any null values found in the selected data. If this option isn't specified, null values are unloaded as:

  • Zero-length strings for delimited output

  • Whitespace strings for fixed-width output

If a null string is specified for a fixed-width unload and the width of an output column is less than the width of the null string, the following behavior occurs:

  • An empty field is output for non-character columns

  • An error is reported for character columns

ESCAPE

For CHAR and VARCHAR columns in delimited unload files, an escape character (\) is placed before every occurrence of the following characters:

  • Linefeed: \n

  • Carriage return: \r

  • The delimiter character specified for the unloaded data.

  • The escape character: \

  • A quotation mark character: " or ' (if both ESCAPE and ADDQUOTES are specified in the UNLOAD command).

Important

If you loaded your data using a COPY with the ESCAPE option, you must also specify the ESCAPE option with your UNLOAD command to generate the reciprocal output file. Similarly, if you UNLOAD using the ESCAPE option, you need to use ESCAPE when you COPY the same data.

ALLOWOVERWRITE

By default, UNLOAD fails if it finds files that it would possibly overwrite. If ALLOWOVERWRITE is specified, UNLOAD overwrites existing files, including the manifest file.

PARALLEL

By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. The default option is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used. The maximum size for a data file is 6.2 GB. So, for example, if you unload 13.4 GB of data, UNLOAD creates the following three files.

s3://mybucket/key000 6.2 GB s3://mybucket/key001 6.2 GB s3://mybucket/key002 1.0 GB
Note

The UNLOAD command is designed to use parallel processing. We recommend leaving PARALLEL enabled for most cases, especially if the files are used to load tables using a COPY command.

MAXFILESIZE [AS] max-size [ MB | GB ]

Specifies the maximum size of files that UNLOAD creates in Amazon S3. Specify a decimal value between 5 MB and 6.2 GB. The AS keyword is optional. The default unit is MB. If MAXFILESIZE isn't specified, the default maximum file size is 6.2 GB. The size of the manifest file, if one is used, isn't affected by MAXFILESIZE.

REGION [AS] 'aws-region'

Specifies the AWS Region where the target Amazon S3 bucket is located. REGION is required for UNLOAD to an Amazon S3 bucket that isn't in the same AWS Region as the Amazon Redshift cluster.

The value for aws_region must match an AWS Region listed in the Amazon Redshift regions and endpoints table in the AWS General Reference.

By default, UNLOAD assumes that the target Amazon S3 bucket is located in the same AWS Region as the Amazon Redshift cluster.

Usage notes

Using ESCAPE for all delimited text UNLOAD operations

When you UNLOAD using a delimiter, your data can include that delimiter or any of the characters listed in the ESCAPE option description. In this case, you must use the ESCAPE option with the UNLOAD statement. If you don't use the ESCAPE option with the UNLOAD, subsequent COPY operations using the unloaded data might fail.

Important

We strongly recommend that you always use ESCAPE with both UNLOAD and COPY statements. The exception is if you are certain that your data doesn't contain any delimiters or other characters that might need to be escaped.

Loss of floating-point precision

You might encounter loss of precision for floating-point data that is successively unloaded and reloaded.

Limit clause

The SELECT query can't use a LIMIT clause in the outer SELECT. For example, the following UNLOAD statement fails.

unload ('select * from venue limit 10') to 's3://mybucket/venue_pipe_' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

Instead, use a nested LIMIT clause, as in the following example.

unload ('select * from venue where venueid in (select venueid from venue order by venueid desc limit 10)') to 's3://mybucket/venue_pipe_' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

You can also populate a table using SELECT…INTO or CREATE TABLE AS using a LIMIT clause, then unload from that table.

Unloading a column of the GEOMETRY data type

You can only unload GEOMETRY columns to text or CSV format. You can't unload GEOMETRY data with the FIXEDWIDTH option. The data is unloaded in the hexadecimal form of the extended well-known binary (EWKB) format. If the size of the EWKB data is more than 4 MB, then a warning occurs because the data can't later be loaded into a table.

Unloading the HLLSKETCH data type

You can only unload HLLSKETCH columns to text or CSV format. You can't unload HLLSKETCH data with the FIXEDWIDTH option. The data is unloaded in the Base64 format for dense HyperLogLog sketches or in the JSON format for sparse HyperLogLog sketches. For more information, see HyperLogLog functions.

The following example exports a table containing HLLSKETCH columns into a file.

CREATE TABLE a_table(an_int INT, b_int INT); INSERT INTO a_table VALUES (1,1), (2,1), (3,1), (4,1), (1,2), (2,2), (3,2), (4,2), (5,2), (6,2); CREATE TABLE hll_table (sketch HLLSKETCH); INSERT INTO hll_table select hll_create_sketch(an_int) from a_table group by b_int; UNLOAD ('select * from hll_table') TO 's3://mybucket/unload/' IAM_ROLE 'arn:aws:iam::0123456789012:role/MyRedshiftRole' NULL AS 'null' ALLOWOVERWRITE CSV;

FORMAT AS PARQUET clause

Be aware of these considerations when using FORMAT AS PARQUET:

  • Unload to Parquet doesn't use file level compression. Each row group is compressed with SNAPPY.

  • If MAXFILESIZE isn't specified, the default maximum file size is 6.2 GB. You can use MAXFILESIZE to specify a file size of 5 MB–6.2 GB. The actual file size is approximated when the file is being written, so it might not be exactly equal to the number you specify.

    To maximize scan performance, Amazon Redshift tries to create Parquet files that contain equally sized 32-MB row groups. The MAXFILESIZE value that you specify is automatically rounded down to the nearest multiple of 32 MB. For example, if you specify MAXFILESIZE 200 MB, then each Parquet file unloaded is approximately 192 MB (32 MB row group x 6 = 192 MB).

  • If a column uses TIMESTAMPTZ data format, only the timestamp values are unloaded. The time zone information isn't unloaded.

  • Don't specify file name prefixes that begin with underscore (_) or period (.) characters. Redshift Spectrum treats files that begin with these characters as hidden files and ignores them.

PARTITION BY clause

Be aware of these considerations when using PARTITION BY:

  • Partition columns aren't included in the output file.

  • Make sure to include partition columns in the SELECT query used in the UNLOAD statement. You can specify any number of partition columns in the UNLOAD command. However, there is a limitation that there should be at least one nonpartition column to be part of the file.

  • If the partition key value is null, Amazon Redshift automatically unloads that data into a default partition called partition_column=__HIVE_DEFAULT_PARTITION__.

  • The UNLOAD command doesn't make any calls to an external catalog. To register your new partitions to be part of your existing external table, use a separate ALTER TABLE ... ADD PARTITION ... command. Or you can run a CREATE EXTERNAL TABLE command to register the unloaded data as a new external table. You can also use an AWS Glue crawler to populate your Data Catalog. For more information, see Defining Crawlers in the AWS Glue Developer Guide.

  • If you use the MANIFEST option, Amazon Redshift generates only one manifest file in the root Amazon S3 folder.

  • The column data types that you can use as the partition key are SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, BOOLEAN, CHAR, VARCHAR, DATE, and TIMESTAMP.