Amazon Redshift
Database Developer Guide (API Version 2012-12-01)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Did this page help you?  Yes | No |  Tell us about it...

COPY

Loads data into a table from flat files located in an Amazon S3 bucket, Amazon EMR cluster, Amazon DynamoDB table, or remote host using an SSH connection. The COPY command appends the new input data to any existing rows in the table.

The maximum size of a single input row from any source is 4 MB.

Note

To use the COPY command, you must have INSERT privilege for the Amazon Redshift table.

COPY from Amazon S3, Amazon EMR, Remote Hosts (SSH)

The following gives the syntax for COPY from Amazon S3, COPY from Amazon EMR, and COPY from a remote host (COPY from SSH). The syntax for a COPY from Amazon DynamoDB follows separately.

COPY table_name [ (column1 [,column2, ...]) ]
FROM 
{ 's3://copy_from_s3_objectpath' 
| 's3://copy_from_s3_manifest_file' 
| 's3://copy_from_ssh_manifest_file' 
| 'emr://emr_cluster_id/hdfs_filepath'  
}
[ WITH ] CREDENTIALS [AS] 'aws_access_credentials'
[ option [ ... ] ]

where option is 

{ FIXEDWIDTH 'fixedwidth_spec' 
| DELIMITER [ AS ] 'delimiter_char'  
| CSV [ QUOTE [ AS ] 'quote_character' ]
| JSON [AS] { 'auto' | 's3://jsonpaths_file' } }




| REGION [AS] 'aws_region'
| MANIFEST
| SSH
| ENCRYPTED
| GZIP
| LZOP
| REMOVEQUOTES
| EXPLICIT_IDS
| ACCEPTINVCHARS [ AS ] ['replacement_char']
| MAXERROR [ AS ] error_count
| DATEFORMAT [ AS ] { 'dateformat_string' | 'auto' }
| TIMEFORMAT [ AS ] { 'timeformat_string' | 'auto' | 'epochsecs' | 'epochmillisecs' }
| IGNOREHEADER [ AS ] number_rows
| ACCEPTANYDATE
| IGNOREBLANKLINES
| TRUNCATECOLUMNS
| FILLRECORD
| TRIMBLANKS
| NOLOAD
| NULL [ AS ] 'null_string'
| EMPTYASNULL
| BLANKSASNULL
| COMPROWS numrows
| COMPUPDATE [ { ON | TRUE } | { OFF | FALSE } ]
| STATUPDATE [ { ON | TRUE } | { OFF | FALSE } ]
| ESCAPE
| ROUNDEC 

COPY from Amazon DynamoDB Syntax

COPY table_name [ (column1 [,column2, ...]) ]
FROM 'dynamodb://table_name'
[ WITH ] CREDENTIALS [AS] 'aws_access_credentials'
READRATIO ratio
[ option [ ... ] ]

where option is 

| REGION [AS] 'aws_region'
| EXPLICIT_IDS
| MAXERROR [ AS ] error_count
| DATEFORMAT [ AS ] { 'dateformat_string' | 'auto' }
| TIMEFORMAT [ AS ] { 'timeformat_string' | 'auto' | 'epochsecs' | 'epochmillisecs' }
| ACCEPTANYDATE
| TRUNCATECOLUMNS
| TRIMBLANKS
| NOLOAD
| EMPTYASNULL
| BLANKSASNULL
| COMPROWS numrows
| COMPUPDATE [ { ON | TRUE } | { OFF | FALSE } ]
| STATUPDATE [ { ON | TRUE } | { OFF | FALSE } ]
| ROUNDEC 

Parameters

table_name

Target table for the COPY command. The table must already exist in the database. The table can be temporary or persistent. The COPY command appends the new input data to any existing rows in the table.

(column1 [, column2, ...])

Specifies an optional column list to load data fields into specific columns. The columns can be in any order in the COPY statement, but when loading from flat files, such as in an Amazon S3 bucket, their order must match the order of the source data. Order does not matter when loading from an Amazon DynamoDB table. Any columns omitted from the column list are assigned either the defined DEFAULT expression or NULL if the omitted column is nullable and has no defined DEFAULT expression. If an omitted column is NOT NULL but has no defined DEFAULT expression, the COPY command fails.

If an IDENTITY column is included in the column list, then EXPLICIT_IDS must also be specified; if an IDENTITY column is omitted, then EXPLICIT_IDS cannot be specified. If no column list is specified, the command behaves as if a complete, in-order column list was specified (with IDENTITY columns omitted if EXPLICIT_IDS was also not specified).

FROM

Specifies the source of the data to be loaded. You can use the COPY command to load data to an Amazon Redshift table from data files located in an Amazon S3 bucket, Amazon EMR cluster, Amazon DynamoDB table, or remote host using an SSH connection. To export data from a table to a set of files, use the UNLOAD command.

The maximum size of a single input row from any source is 4 MB.

Important

If the Amazon S3 bucket or Amazon DynamoDB table does not reside in the same region as your Amazon Redshift cluster, you must use the REGION option to specify the region in which the data is located. For COPY from Amazon EMR, the Amazon EMR cluster must reside in the same region as your Amazon Redshift cluster.

For COPY from Amazon EMR, if the Amazon Redshift cluster is in a VPC, the Amazon EMR cluster must be in the same VPC group. If your Amazon Redshift cluster is in EC2-Classic (not in a VPC), the Amazon EMR cluster must also be in EC2-Classic.

's3://copy_from_s3_objectpath'

The path to the Amazon S3 objects that contain the data—for example, 's3://mybucket/cust.txt'. The s3://copy_from_s3_objectpath parameter can reference a single file or a set of objects or folders that have the same key prefix. For example, the name custdata.txt is a key prefix that refers to a number of physical files: custdata.txt.1, custdata.txt.2, and so on. The key prefix can also reference a number of folders. For example, 's3://mybucket/custfolder' refers to the folders custfolder_1, custfolder_2, and so on. If a key prefix references multiple folders, all of the files in the folders will be loaded.

Important

If the Amazon S3 bucket that holds the data files does not reside in the same region as your cluster, you must use the REGION option to specify the region in which the data is located. .

For more information, see Loading Data from Amazon S3.

's3://copy_from_s3_manifest_file'

The Amazon S3 object key for a manifest file that lists the data files to be loaded. The 's3://copy_from_s3_manifest_file' parameter must explicitly reference a single file—for example, 's3://mybucket/manifest.txt'. It cannot be a key prefix.

The manifest is a text file in JSON format that lists the URL of each file that is to be loaded from Amazon S3. The URL includes the bucket name and full object path for the file. The files that are specified in the manifest can be in different buckets, but all the buckets must be in the same region as the Amazon Redshift cluster. The following example shows the JSON for a manifest that loads three files.

{
  "entries": [
    {"url":"s3://mybucket-alpha/custdata.1","mandatory":true},
    {"url":"s3://mybucket-alpha/custdata.2","mandatory":true},
    {"url":"s3://mybucket-beta/custdata.1","mandatory":false}
  ]
}

The double quotes are required. Each entry in the manifest can optionally include a mandatory flag. If mandatory is set to true, COPY terminates if it does not find the file for that entry; otherwise, COPY will continue. Regardless of any mandatory settings, COPY will terminate if no files are found. The default value for mandatory is false.

The manifest file must not be encrypted or compressed, even if the ENCRYPTED, GZIP, or LZOP options are specified. COPY returns an error if the specified manifest file is not found or the manifest file is not properly formed.

If a manifest file is used, the MANIFEST option must be specified with the COPY command. If the MANIFEST option is not specified, COPY assumes that the file specified with FROM is a data file.

For more information, see Using a Manifest to Specify Data Files.

's3://copy_from_ssh_manifest_file'

The COPY command can connect to multiple hosts using SSH, and can create multiple SSH connections to each host. COPY executes a command through each host connection, and then loads the output from the commands in parallel into the table. The s3://copy_from_ssh_manifest_file parameter specifies the Amazon S3 object key for the manifest file that provides the information COPY will use to open SSH connections and execute the remote commands.

The s3://copy_from_ssh_manifest_file parameter must explicitly reference a single file; it cannot be a key prefix. For example:

's3://mybucket/ssh_manifest.txt'

The manifest file is a text file in JSON format that Amazon Redshift uses to connect to the host. The manifest file specifies the SSH host endpoints and the commands that will be executed on the hosts to return data to Amazon Redshift. Optionally, you can include the host public key, the login user name, and a mandatory flag for each entry. The following example shows a manifest file that creates two SSH connections:

{ 
    "entries": [ 
	    {"endpoint":"<ssh_endpoint_or_IP>", 
           "command": "<remote_command>",
           "mandatory":true, 
           "publickey": “<public_key>”, 
           "username": “<host_user_name>”}, 
	    {"endpoint":"<ssh_endpoint_or_IP>", 
           "command": "<remote_command>",
           "mandatory":true, 
           "publickey": “<public_key>”, 
           "username": “<host_user_name>”} 
     ] 
}

The manifest file contains one "entries" construct for each SSH connection. You can have multiple connections to a single host or multiple connections to multiple hosts. The double quotes are required as shown, both for the field names and the values. The only value that does not need double quotes is the Boolean value true or false for the "mandatory" field.

The following list describes the fields in the manifest file.

endpoint

The URL address or IP address of the host—for example, "ec2-111-222-333.compute-1.amazonaws.com", or "198.51.100.0".

command

The command that will be executed by the host to generate text or binary (gzip or lzop) output. The command can be any command that the user "host_user_name" has permission to run. The command can be as simple as printing a file, or it could query a database or launch a script. The output (text file, gzip binary file, or lzop binary file) must be in a form the Amazon Redshift COPY command can ingest. For more information, see Preparing Your Input Data.

publickey

(Optional) The public key of the host. If provided, Amazon Redshift will use the public key to identify the host. If the public key is not provided, Amazon Redshift will not attempt host identification. For example, if the remote host's public key is: ssh-rsa AbcCbaxxx…Example root@amazon.com, enter the following text in the publickey field: "AbcCbaxxx…Example"

mandatory

(Optional) Indicates whether the COPY command should fail if the connection fails. The default is false. If Amazon Redshift does not successfully make at least one connection, the COPY command fails.

username

(Optional) The user name that will be used to log in to the host system and execute the remote command. The user login name must be the same as the login that was used to add the Amazon Redshift cluster's public key to the host's authorized keys file. The default username is redshift.

For more information about creating a manifest file, see Loading Data Process.

To COPY from a remote host, the SSH option must be specified with the COPY command. If the SSH option is not specified, COPY assumes that the file specified with FROM is a data file and will fail.

The following options cannot be used with COPY from SSH:

  • MANIFEST

  • ENCRYPTED

If you use automatic compression, the COPY command performs two data read operations, which means it will execute the remote command twice. The first read operation is to provide a data sample for compression analysis, then the second read operation actually loads the data. If executing the remote command twice might cause a problem, you should disable automatic compression. To disable automatic compression, run the COPY command with the COMPUPDATE option set to OFF. For more information, see Loading Tables with Automatic Compression.

For detailed procedures for using COPY from SSH, see Loading Data from Remote Hosts

'dynamodb://table_name'

The name of the Amazon DynamoDB table that contains the data. For example, 'dynamodb://ProductCatalog'. For details about how Amazon DynamoDB attributes are mapped to Amazon Redshift columns, see Loading Data from an Amazon DynamoDB Table.

An Amazon DynamoDB table name is unique to an AWS account, which is identified by the AWS access credentials.

'emr://emr_cluster_id/hdfs_file_path'

The unique identifier for the Amazon EMR cluster and the Hadoop Distributed File System (HDFS) file path that references the data files for the COPY command. The HDFS data file names must not contain the wildcard characters asterisk (*) and question mark (?).

Note

The Amazon EMR cluster must continue running until the COPY operation completes. If any of the HDFS data files are changed or deleted before the COPY operation completes, you might have unexpected results, or the COPY operation might fail.

You can use the wildcard characters asterisk (*) and question mark (?) as part of the hdfs_file_path argument to specify multiple files to be loaded. For example, 'emr://j-SAMPLE2B500FC/myoutput/part*' identifies the files part-0000, part-0001, and so on. If the file path does not contain wildcard characters, it is treated as a string literal. If you specify only a folder name, COPY attempts to load all files in the folder.

Important

If you use wildcard characters or use only the folder name, verify that no unwanted files will be loaded. For example, some processes might write a log file to the output folder.

For more information, see Loading Data from Amazon EMR.

WITH

This keyword is optional.

CREDENTIALS [AS] 'aws_access_credentials'

The AWS account access credentials for the Amazon S3 bucket or Amazon DynamoDB table, or Amazon EMR cluster that contains the data or manifest file. The access key and secret access key are required. If your data is encrypted, credentials must include a master symmetric key. If you are using temporary access credentials, you must include the temporary session token in the credentials string. For more information, see Temporary Security Credentials in the COPY usage notes.

The aws_access_credentials string is case sensitive and must not contain spaces.

The access credentials must belong to an AWS account user or an IAM user with the following permissions:

  • For COPY from Amazon S3, permission to LIST and GET the Amazon S3 objects that are being loaded, and the manifest file, if one is used.

  • For COPY from Amazon S3, Amazon EMR, and remote hosts (SSH) with JSON-formatted data, permission to LIST and GET the jsonpaths file on Amazon S3, if one is used.

  • For COPY from Amazon DynamoDB, permission to SCAN and DESCRIBE the Amazon DynamoDB table that is being loaded.

  • For COPY from an Amazon EMR cluster, permission for the ListInstances action on the Amazon EMR cluster.

If only an access key and secret access key are required, the aws_access_credentials string is in the following format:

'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'

You must replace the <access-key-id> and <secret-access-key> with valid AWS account credentials or IAM user credentials.

To use temporary token credentials, you must provide the temporary access key ID, the temporary secret access key, and the temporary token. The aws_access_credentials string is in the following format:

'aws_access_key_id=<temporary-access-key-id>;aws_secret_access_key=<temporary-secret-access-key>;token=<temporary-token>'

If the ENCRYPTED option is used, the aws_access_credentials string is in the following format:

'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>;master_symmetric_key=<master_key>'  

where <master_key> is the value of the master key that was used to encrypt the files.

READRATIO [AS] ratio

Specify the percentage of the Amazon DynamoDB table's provisioned throughput to use for the data load. READRATIO is required for a COPY from Amazon DynamoDB. It cannot be used with a COPY from Amazon S3. We highly recommend setting the ratio to a value less than the average unused provisioned throughput. Valid values are integers 1-200.

Caution

Setting READRATIO to 100 or higher will enable Amazon Redshift to consume the entirety of the Amazon DynamoDB table's provisioned throughput, which will seriously degrade the performance of concurrent read operations against the same table during the COPY session. Write traffic will be unaffected. Values higher than 100 are allowed to troubleshoot rare scenarios when Amazon Redshift fails to fulfill the provisioned throughput of the table. If you load data from Amazon DynamoDB to Amazon Redshift on an ongoing basis, consider organizing your Amazon DynamoDB tables as a time series to separate live traffic from the COPY operation.

'option'

Optional list of load options.

FIXEDWIDTH 'fixedwidth_spec'

Loads the data from a file where each column width is a fixed length, rather than separated by a delimiter. The fixedwidth_spec is a string that specifies a user-defined column label and column width. The column label can be either a text string or an integer, depending on what the user chooses. The column label has no relation to the column name. The order of the label/width pairs must match the order of the table columns exactly. FIXEDWIDTH cannot be used with DELIMITER. In Amazon Redshift, the length of CHAR and VARCHAR columns is expressed in bytes, so be sure that the column width that you specify accommodates the binary length of multibyte characters when preparing the file to be loaded. For more information, see Character Types.

The format for fixedwidth_spec is shown below:

'colLabel1:colWidth1,colLabel:colWidth2, ...'
DELIMITER [AS] ['delimiter_char']

Single ASCII character that is used to separate fields in the input file, such as a pipe character ( | ), a comma ( , ), or a tab ( \t ). Non-printing ASCII characters are supported. ASCII characters can also be represented in octal, using the format '\ddd', where 'd' is an octal digit (0-7). The default delimiter is a pipe character ( | ), unless the CSV option is used, in which case the default delimiter is a comma ( , ). The AS keyword is optional. DELIMITER cannot be used with FIXEDWIDTH.

CSV

Enables use of CSV format in the input data. To automatically escape delimiters, newline characters, and carriage returns, enclose the field in the character specified by the QUOTE option. The default quote character is a double quote ( " ). When the quote character is used within the field, escape the character with an additional quote character. For example, if the quote character is a double quote, to insert the string A "quoted" word, the input file would include the string "A ""quoted"" word". When the CSV option is used, the default delimiter is a comma ( , ). You can specify a different delimiter by using the DELIMITER option.

When a field is enclosed in quotes, white space between the delimiters and the quote characters is ignored. If the delimiter is a whitespace character, such as a tab, the delimiter is not treated as whitespace.

CSV cannot be used with FIXEDWIDTH, REMOVEQUOTES, or ESCAPE.

QUOTE [AS] 'quote_character'

Specifies the character to be used as the quote character when using the CSV option. The default is a double quote ( " ). If you use the QUOTE option to define a quote character other than double quote, you don’t need to escape the double quotes within the field. The QUOTE option can be used only with the CSV option. The AS keyword is optional.

JSON [AS] 'json_option'

Specifies that the source data is in JSON format.

JSON format is supported for COPY from Amazon S3, COPY from Amazon EMR, and COPY from SSH. JSON is not supported for COPY from Amazon DynamoDB.

Valid values:

  • 'auto'

  • 's3://jsonpaths_file'

The default is 'auto'.

'auto'

COPY maps the data elements in the JSON source data to the columns in the target table by matching object keys, or names, in the source name/value pairs to the names of columns in the target table. The 'auto' argument is case-sensitive and must be enclosed in single quotation marks.

's3://jsonpaths_file'

COPY uses the named JSONPaths file to map the data elements in the JSON source data to the columns in the target table. The s3://jsonpaths_file parameter must be an Amazon S3 object key that explicitly references a single file, such as 's3://mybucket/jsonpaths.txt'; it cannot be a key prefix. For more information about using a JSONPaths file, see JSONPaths file in this syntax reference.

Note

If the file specified by jsonpaths_file has the same prefix as the path specified by copy_from_s3_objectpath for the data files, COPY reads the JSONPaths file as a data file and returns errors. For example, if your data files use the object path s3://mybucket/my_data.json and your JSONPaths file is s3://mybucket/my_data.jsonpaths, COPY attempts to load my_data.jsonpaths as a data file.

JSON Data File

The JSON data file contains a set of either objects or arrays. COPY loads each JSON object or array into one row in the target table. Each object or array corresponding to a row must be a standalone, root-level structure; that is, it must not be a member of another JSON structure.

A JSON object begins and ends with braces, and contains an unordered collection of name/value pairs. Each paired name and value are separated by a colon, and the pairs are separated by commas. By default, the object key, or name, in the name/value pairs must match the name of the corresponding column in the table. If your column names and JSON object keys do not match, use a JSONPaths file to map columns to keys.

Order in a JSON object does not matter. Any names that don't match a column name are ignored. The following shows the structure of a simple JSON object.

{
  "column1": "value1",
  "column2": value2,
  "notacolumn" : "ignore this value"
}

A JSON array begins and ends with brackets, and contains an ordered collection of values separated by commas. If your data files use arrays, you must specify a JSONPaths file to match the values to columns. The following shows the structure of a simple JSON array.

["value1", value2]

The JSON must be well-formed. For example, the objects or arrays cannot be separated by commas or any other characters except white space.

The maximum size of a single JSON object or array, including braces or brackets, is 4 MB. This is distinct from the maximum row size. If the maximum size of a single JSON object or array is exceeded, even if the resulting row size is less than the 4 MB row-size limit, the COPY command fails.

COPY loads \n as a newline character and loads \t as a tab character. To load a backslash, escape it with a backslash ( \\ ).

COPY searches the specified JSON source for a well-formed, valid JSON object or array. If COPY encounters any non-white space characters before locating a usable JSON structure, or between valid JSON objects or arrays, COPY returns an error for each instance These errors count toward the MAXERROR error count. When the error count equals or exceeds MAXERROR, COPY fails.

For each error, Amazon Redshift records a row in the STL_LOAD_ERRORS system table. The LINE_NUMBER column records the last line of the JSON object that caused the error.

If IGNOREHEADER is specified, COPY ignores the specified number of lines in the JSON data. Newline characters in the JSON data are always counted for IGNOREHEADER calculations.

COPY loads empty strings as empty fields by default. If EMPTYASNULL is specified, COPY loads empty strings for CHAR and VARCHAR fields as NULL. Empty strings for other data types, such as INT, are always loaded with NULL.

The following options are not supported with JSON:

  • CSV

  • DELIMITER

  • ESCAPE

  • FILLRECORD

  • FIXEDWIDTH

  • IGNOREBLANKLINES

  • NULL AS

  • READRATIO

  • REMOVEQUOTES

For more information, see COPY from JSON Format in the following usage notes. For more information about JSON data structures, go to www.json.org.

JSONPaths File

If you are loading from JSON-formatted data, by default, COPY maps the data elements in the JSON source data to the columns in the target table by matching each name, or object key, in a JSON name/value pair, to the name of a column in the target table. For more information, see JSON.

If your column names and object keys do not match, you can use a JSONPaths file to explicitly map JSON data elements to columns. The JSONPaths file maps JSON data elements to columns by matching the column order in the target table.

The JSONPaths file must contain only a single JSON object (not an array). The JSON object is a name/value pair. The object key, which is the name in the name/value pair, must be "jsonpaths". The value in the name/value pair is an array of JSONPath expressions. Each JSONPath expression references a single element in the JSON data hierarchy, similarly to how an XPath expression refers to elements in an XML document. For more information, see JSONPath Expressions later in this topic.

To use a JSONPaths file, add the JSON keyword to the COPY command and specify the Amazon S3 bucket name and object path of the JSONPaths file, using the following format.

COPY tablename 
FROM 'data_source' 
CREDENTIALS 'credentials_string' 
FORMAT JSON 's3://jsonpaths_file';

The s3://jsonpaths_file parameter must be an Amazon S3 object key that explicitly references a single file, such as 's3://mybucket/jsonpaths.txt'; it cannot be a key prefix.

Note

If you are loading from Amazon S3 and the file specified by jsonpaths_file has the same prefix as the path specified by copy_from_s3_objectpath for the data files, COPY reads the JSONPaths file as a data file and returns errors. For example, if your data files use the object path s3://mybucket/my_data.json and your JSONPaths file is s3://mybucket/my_data.jsonpaths, COPY attempts to load my_data.jsonpaths as a data file.

Note

If the key name is any string other than "jsonpaths", the COPY command does not return an error, but it ignores jsonpaths_file and uses the 'auto' option instead.

If the any of the following occur, the COPY command fails:

  • The JSON is malformed

  • There is more than one JSON object

  • Any characters except white space exist outside the object

  • An array element is an empty string or is not a string

MAXERROR does not apply to the JSONPaths file.

If the ENCRYPTED option is used with COPY, the JSONPaths file is not encrypted.

JSONPath Expressions

Each JSONPath expression corresponds to one column in the Amazon Redshift target table. The order of the JSONPaths array elements must match the order of the columns in the target table or the column list, if a column list is used. If an element referenced by a JSONPath expression is not found in the JSON data, COPY attempts to load a NULL value.

The JSONPath expressions can use either bracket notation or dot notation, but you cannot mix notations. The following example shows JSONPath expressions using bracket notation.

{
    "jsonpaths": [
        "$['venuename']",
        "$['venuecity']",
        "$['venuestate']",
        "$['venueseats']"
    ]
}

The following example shows JSONPath expressions using dot notation.

{
    "jsonpaths": [
        "$.venuename",
        "$.venuecity",
        "$.venuestate",
        "$.venueseats"
    ]
}

In the context of Amazon Redshift COPY syntax, a JSONPath expression must specify the explicit path to a single name element in a JSON hierarchical data structure. Amazon Redshift does not support any JSONPath elements, such as wildcard characters or filter expressions, that might resolve to an ambiguous path or multiple name elements.

For more information, see COPY from JSON Format in the following usage notes.

REGION [AS] 'aws_region'

Specifies the AWS region where the source data is located. REGION is required for COPY from an Amazon S3 bucket or an Amazon DynamoDB table when the AWS resource that contains the data is not in the same region as the Amazon Redshift cluster. The value for aws_region must match a region listed in the tables on the Regions and Endpoints page. COPY does not support REGION for COPY from Amazon EMR.

If the REGION option is specified, all resources, including a manifest file or multiple Amazon S3 buckets must be located in the specified region.

Note

Transferring data across regions incurs additional charges against the Amazon S3 bucket or the Amazon DynamoDB table that contains the data. For more information about pricing, go to Data Transfer OUT From Amazon S3 To Another AWS Region on the Amazon S3 Pricing page and Data Transfer OUT on the Amazon DynamoDB Pricing page.

By default, COPY assumes that the data is located in the same region as the Amazon Redshift cluster.

MANIFEST

Specifies that a manifest is used to identify the data files to be loaded from Amazon S3. If the MANIFEST option is used, COPY loads data from the files listed in the manifest referenced by 's3://copy_from_s3_manifest_file'. If the manifest file is not found, or is not properly formed, COPY fails.

ENCRYPTED

Specifies that the input files on Amazon S3 are encrypted. See Loading Encrypted Data Files from Amazon S3. If the encrypted files are in gzip format, add the GZIP option.

GZIP

Specifies that the input file or files are in compressed gzip format (.gz files). The COPY operation reads the compressed file and uncompresses the data as it loads.

LZOP

Specifies that the input file or files are in compressed lzop format (.lzo files). The COPY operation reads the compressed file and uncompresses the data as it loads.

Note

COPY does not support files that are compressed using the lzop '--filter' option.

REMOVEQUOTES

Surrounding quotation marks are removed from strings in the incoming data. All characters within the quotes, including delimiters, are retained. If a string has a beginning single or double quotation mark but no corresponding ending mark, the COPY command fails to load that row and returns an error. The following table shows some simple examples of strings that contain quotes and the resulting loaded values.

Input String Loaded Value with REMOVEQUOTES Option
"The delimiter is a pipe (|) character"The delimiter is a pipe (|) character
'Black' Black
"White" White
Blue' Blue'
'Blue Value not loaded: error condition
"Blue Value not loaded: error condition
' ' 'Black' ' ' ' 'Black' '
' ' <white space>

EXPLICIT_IDS

Use EXPLICIT_IDS with tables that have IDENTITY columns if you want to override the auto-generated values with explicit values from the source data files for the tables. If the command includes a column list, that list must include the IDENTITY columns to use this option. The data format for EXPLICIT_IDS values must match the IDENTITY format specified by the CREATE TABLE definition.

ACCEPTINVCHARS [AS] ['replacement_char']

Enables loading of data into VARCHAR columns even if the data contains invalid UTF-8 characters. When ACCEPTINVCHARS is specified, COPY replaces each invalid UTF-8 character with a string of equal length consisting of the character specified by replacement_char. For example, if the replacement character is '^', an invalid three-byte character will be replaced with '^^^'.

The replacement character can be any ASCII character except NULL. The default is a question mark ( ? ). For information about invalid UTF-8 characters, see Multibyte Character Load Errors.

COPY returns the number of rows that contained invalid UTF-8 characters, and it adds an entry to the STL_REPLACEMENTS system table for each affected row, up to a maximum of 100 rows per node slice. Additional invalid UTF-8 characters are also replaced, but those replacement events are not recorded.

If ACCEPTINVCHARS is not specified, COPY returns an error whenever it encounters an invalid UTF-8 character.

ACCEPTINVCHARS is valid only for VARCHAR columns.

MAXERROR [AS] error_count

If the load returns the error_count number of errors or greater, the load fails. If the load returns fewer errors, it continues and returns an INFO message that states the number of rows that could not be loaded. Use this option to allow loads to continue when certain rows fail to load into the table because of formatting errors or other inconsistencies in the data. Set this value to 0 or 1 if you want the load to fail as soon as the first error occurs. The AS keyword is optional.

The actual number of errors reported might be greater than the specified MAXERROR because of the parallel nature of Amazon Redshift. If any node in the Amazon Redshift cluster detects that MAXERROR has been exceeded, each node reports all of the errors it has encountered.

DATEFORMAT [AS] {'dateformat_string' | 'auto' }

If no DATEFORMAT is specified, the default format is 'YYYY-MM-DD'. For example, an alternative valid format is 'MM-DD-YYYY'.

If you want Amazon Redshift to automatically recognize and convert the date format in your source data, specify 'auto'. The 'auto'' keyword is case sensitive. If the COPY command does not recognize the format of your date or time values, or if your date or time values use different formats, use the 'auto' option with the DATEFORMAT or TIMEFORMAT parameter. The 'auto' option recognizes several formats that are not supported when using a DATEFORMAT and TIMEFORMAT string. For more information, see Using Automatic Recognition with DATEFORMAT and TIMEFORMAT.

The date format can include time information (hour, minutes, seconds), but this information is ignored. The AS keyword is optional. For more information, see DATEFORMAT and TIMEFORMAT Strings .

TIMEFORMAT [AS] {'timeformat_string' | 'auto' | 'epochsecs' | 'epochmillisecs' }

If no TIMEFORMAT is specified, the default format is YYYY-MM-DD HH:MI:SS. for more information about timeformat_string, see DATEFORMAT and TIMEFORMAT Strings.

If you want Amazon Redshift to automatically recognize and convert the time format in your source data, specify 'auto'. If the COPY command does not recognize the format of your date or time values, or if your date or time values use different formats, use the 'auto' option with the DATEFORMAT or TIMEFORMAT parameter. The 'auto' option recognizes several formats that are not supported when using a DATEFORMAT and TIMEFORMAT string. For more information, see Using Automatic Recognition with DATEFORMAT and TIMEFORMAT.

If your source data is represented as epoch time, the number of seconds or milliseconds since Jan 1, 1970 00:00:00 UTC, specify 'epochsecs' or 'epochmillisecs'.

The 'auto', 'epochsecs', and 'epochmillisecs' keywords are case sensitive.

The AS keyword is optional.

IGNOREHEADER [ AS ] number_rows

Treats the specified number_rows as a file header and does not load them. Use IGNOREHEADER to skip file headers in all files in a parallel load.

ACCEPTANYDATE

Allows any date format, including invalid formats such as 00/00/00 00:00:00, to be loaded without generating an error. Applies only to TIMESTAMP and DATE columns. Always use ACCEPTANYDATE with the DATEFORMAT option. If the date format for the data does not match the DATEFORMAT specification, Amazon Redshift inserts a NULL value into that field.

IGNOREBLANKLINES

Ignores blank lines that only contain a line feed in a data file and does not try to load them.

TRUNCATECOLUMNS

Truncates data in columns to the appropriate number of characters so that it fits the column specification. Applies only to columns with a VARCHAR or CHAR data type, and rows 4 MB or less in size.

FILLRECORD

Allows data files to be loaded when contiguous columns are missing at the end of some of the records. The missing columns are filled with either zero-length strings or NULLs, as appropriate for the data types of the columns in question. If the EMPTYASNULL option is present in the COPY command and the missing column is a VARCHAR column, NULLs are loaded; if EMPTYASNULL is not present and the column is a VARCHAR, zero-length strings are loaded. NULL substitution only works if the column definition allows NULLs.

For example, if the table definition contains four nullable CHAR columns, and a record contains the values apple, orange, banana, mango, the COPY command could load and fill in a record that contains only the values apple, orange. The missing CHAR values would be loaded as NULL values.

TRIMBLANKS

Removes the trailing whitespace characters from a VARCHAR string. Only applicable to columns with a VARCHAR data type.

NOLOAD

Checks the validity of the data file without actually loading the data. Use the NOLOAD option to make sure that your data file will load without any errors before running the actual data load. Running COPY with the NOLOAD option is much faster than loading the data since it only parses the files.

NULL AS 'null_string'

Loads fields that match null_string as NULL, where null_string can be any string. This option cannot be used with numeric columns. To load NULL into numeric columns, such as INT, use an empty field. If your data includes a null terminator, also referred to as NUL (UTF-8 0000) or binary zero (0x000), COPY treats it as an end of record (EOR) and terminates the record. If a field contains only NUL, you can use NULL AS to replace the null terminator with NULL by specifying '\0' or '\000'. For example, NULL AS '\0' or NULL AS '\000'. If a field contains a string that ends with NUL and NULL AS is specified, the string is inserted with NUL at the end. Do not use '\n' (newline) for the null_string value. Amazon Redshift reserves '\n' for use as a line delimiter. The default null_string is '\N'.

Note

If you attempt to load nulls into a column defined as NOT NULL, the COPY command will fail.

EMPTYASNULL

Indicates that Amazon Redshift should load empty CHAR and VARCHAR fields as NULL. Empty fields for other data types, such as INT, are always loaded with NULL. Empty fields occur when data contains two delimiters in succession with no characters between the delimiters. EMPTYASNULL and NULL AS '' (empty string) produce the same behavior.

BLANKSASNULL

Loads blank fields, which consist of only white space characters, as NULL. This option applies only to CHAR and VARCHAR columns. Blank fields for other data types, such as INT, are always loaded with NULL. For example, a string that contains three space characters in succession (and no other characters) is loaded as a NULL. The default behavior, without this option, is to load the space characters as is.

COMPROWS numrows

Number of rows to be used as the sample size for compression analysis. The analysis is run on rows from each data slice. For example, if you specify COMPROWS 1000000 (1,000,000) and the system contains 4 total slices, no more than 250,000 rows per slice are read and analyzed.

If COMPROWS is not specified, the sample size defaults to 100,000 per slice. Values of COMPROWS lower than the default of 100,000 rows per slice are automatically upgraded to the default value. However, automatic compression will not take place if the amount of data being loaded is insufficient to produce a meaningful sample.

If the COMPROWS number is greater than the number of rows in the input file, the COPY command still proceeds and runs the compression analysis against all of the available rows. The accepted range for this option is a number between 1000 and 1000000000 (1,000,000,000).

COMPUPDATE [ { ON | TRUE } | { OFF | FALSE } ]

Controls whether compression encodings are automatically applied during a COPY.

The COPY command will automatically choose the optimal compression encodings for each column in the target table based on a sample of the input data. For more information, see Loading Tables with Automatic Compression.

If COMPUPDATE is omitted, COPY applies automatic compression only if the target table is empty and all the table columns either have RAW encoding or no encoding. This is the default behavior.

With COMPUPDATE ON (or TRUE), COPY applies automatic compression if the table is empty, even if the table columns already have encodings other than RAW. Existing encodings are replaced. If COMPUPDATE is specified, this is the default.

With COMPUPDATE OFF (or FALSE), automatic compression is disabled.

STATUPDATE [ { ON | TRUE } | { OFF | FALSE } ]

Governs automatic computation and refresh of optimizer statistics at the end of a successful COPY command. By default, if the STATUPDATE option is not used, statistics are updated automatically if the table is initially empty. See also Analyzing Tables.

Whenever ingesting data into a nonempty table significantly changes the size of the table, we recommend updating statistics either by running an ANALYZE command or by using the STATUPDATE ON option.

With STATUPDATE ON (or TRUE), statistics are updated automatically regardless of whether the table is initially empty. If STATUPDATE is used, the current user must be either the table owner or a superuser. If STATUPDATE is not specified, only INSERT permission is required.

With STATUPDATE OFF (or FALSE), statistics are never updated.

ESCAPE

When this option is specified, the backslash character (\) in input data is treated as an escape character. The character that immediately follows the backslash character is loaded into the table as part of the current column value, even if it is a character that normally serves a special purpose. For example, you can use this option to escape the delimiter character, a quote, an embedded newline, or the escape character itself when any of these characters is a legitimate part of a column value.

If you specify the ESCAPE option in combination with the REMOVEQUOTES option, you can escape and retain quotes (' or ") that might otherwise be removed. The default null string, \N, works as is, but can also be escaped in the input data as \\N. As long as you do not specify an alternative null string with the NULL AS option, \N and \\N produce the same results.

Note

The control character 0x00 (NUL) cannot be escaped and should be removed from the input data or converted. This character is treated as an end of record (EOR) marker, causing the remainder of the record to be truncated.

You cannot use the ESCAPE option for FIXEDWIDTH loads, and you cannot specify the escape character itself; the escape character is always the backslash character. Also, you must ensure that the input data contains the escape character in the appropriate places.

Here are some examples of input data and the resulting loaded data when the ESCAPE option is specified. The result for row 4 assumes that the REMOVEQUOTES option is also specified. The input data consists of two pipe-delimited fields:

1|The quick brown fox\[newline]
jumped over the lazy dog.
2| A\\B\\C
3| A \| B \| C
4| 'A Midsummer Night\'s Dream'

The data loaded into column 2 looks like this:

The quick brown fox
jumped over the lazy dog.
A\B\C
A|B|C
A Midsummer Night's Dream

Note

Applying the escape character to the input data for a load is the responsibility of the user. One exception to this requirement is when you reload data that was previously unloaded with the ESCAPE option. In this case, the data will already contain the necessary escape characters.

The ESCAPE option does not interpret octal, hex, unicode, or other escape sequence notation. For example, if your source data contains the octal linefeed value (\012) and you try to load this data with the ESCAPE option, Amazon Redshift loads the value 012 into the table and does not interpret this value as a linefeed that is being escaped.

In order to escape newlines in data that originates from Windows platforms, you might need to use two escape characters: one for the carriage return and one for the linefeed. Alternatively, you can remove the carriage returns before loading the file (for example, by using the dos2unix utility).

ROUNDEC

Rounds up numeric values when the scale of the input value is greater than the scale of the column. By default, COPY truncates values when necessary to fit the scale of the column. For example, if a value of 20.259 is loaded into a DECIMAL(8,2) column, COPY truncates the value to 20.25 by default. If ROUNDEC is specified, COPY rounds the value to 20.26. The INSERT command always rounds values when necessary to match the column's scale, so a COPY command with the ROUNDEC option behaves the same as an INSERT command.