Menu
Amazon Relational Database Service
User Guide (API Version 2014-10-31)

Loading Data into a DB Cluster from Text Files in an Amazon S3 Bucket

Loading data into a table from text files in an Amazon S3 bucket is available for Amazon Aurora version 1.8 and later. For more information on Aurora versions, see Amazon Aurora Database Engine Updates.

You can use the LOAD DATA FROM S3 or LOAD XML FROM S3 command to load data from files stored in an Amazon S3 bucket.

Before you can load data from an Amazon S3 bucket, you must give your Aurora DB cluster permission to access Amazon S3. To grant permission, create an IAM role with the necessary permissions, and then associate the role with your DB cluster. For details and instructions on how to permit your Aurora DB cluster to access Amazon S3 on your behalf, see Authorizing Amazon Aurora to Access Other AWS Services on Your Behalf.

The database user that issues the LOAD DATA FROM S3 or LOAD XML FROM S3 command must be granted the LOAD FROM S3 privilege in order to issue either command. The master username for a DB cluster is granted the LOAD FROM S3 privilege by default. You can grant the privilege to another user by using the following command:

GRANT LOAD FROM S3  on *.* to user@domain-or-ip-address

The LOAD FROM S3 privilege is specific to Amazon Aurora and is not available for MySQL databases or RDS MySQL DB instances. If you have set up replication between an Aurora DB cluster as the replication master and a MySQL database as the replication client, then the GRANT LOAD FROM S3 command will cause replication to stop with an error. The error can be safely skipped in order to resume replication. To skip the error on an RDS MySQL DB instance, use the mysql.rds_skip_repl_error command. To skip the error on an external MySQL database, use the SET GLOBAL sql_slave_skip_counter command.

Specifying a Path to an Amazon S3 Bucket

The syntax for specifying a path to a file or files stored on an Amazon S3 bucket is as follows.

s3-region://bucket-name/file-name-or-prefix

The path includes the following values:

  • region (optional) – The AWS Region that contains the Amazon S3 bucket to load from. This value is optional. If you don't specify a region value, then Aurora loads your file from Amazon S3 in the same region as your DB cluster.

  • bucket-name – The name of the Amazon S3 bucket that contains the data to load. Object prefixes that identify a virtual folder path are supported.

  • file-name-or-prefix – The name of the Amazon S3 file to load or a prefix that identifies more than one file.

LOAD DATA FROM S3

You can use the LOAD DATA FROM S3 command to load data from any text file format that is supported by the MySQL LOAD DATA INFILE command, such as text data that is comma-delimited. Compressed files are not supported.

Syntax

LOAD DATA FROM S3 [FILE | PREFIX] 'S3-URI'
    [REPLACE | IGNORE]
    INTO TABLE tbl_name
    [PARTITION (partition_name,...)]
    [CHARACTER SET charset_name]
    [{FIELDS | COLUMNS}
        [TERMINATED BY 'string']
        [[OPTIONALLY] ENCLOSED BY 'char']
        [ESCAPED BY 'char']
    ]
    [LINES
        [STARTING BY 'string']
        [TERMINATED BY 'string']
    ]
    [IGNORE number {LINES | ROWS}]
    [(col_name_or_user_var,...)]
    [SET col_name = expr,...]

Parameters

Following, you can find a list of the required and optional parameters used by the LOAD DATA FROM S3 command. You can find more details about some of these parameters in LOAD DATA INFILE Syntax in the MySQL documentation.

  • FILE | PREFIX – Identifies whether to load the data from a single file, or from all files that match a given prefix. FILE is the default.

  • REPLACE | IGNORE – Determines what action to take if an input row as the same unique key values as an existing row in the database table.

    • Specify REPLACE if you want the input row to replace the existing row in the table.

    • Specify IGNORE if you want to discard the input row. IGNORE is the default.

  • INTO TABLE – Identifies the name of the database table to load the input rows into.

  • PARTITION – Requires that all input rows be inserted into the partitions identified by the specified list of comma-separated partition names. If an input row cannot be inserted into one of the specified partitions, then the command fails and an error is returned.

  • CHARACTER SET – Identifies the character set of the data in the input file.

  • FIELDS | COLUMNS – Identifies how the fields or columns in the input file are delimited. Fields are tab-delimited by default.

  • LINES – Identifies how the lines in the input file are delimited. Lines are delimited by a carriage return by default.

  • IGNORE number LINES | ROWS – Specifies to ignore a certain number of lines or rows at the start of the input file. For example, you can use IGNORE 1 LINES to skip over an initial header line containing column names, or IGNORE 2 ROWS to skip over the first two rows of data in the input file.

  • col_name_or_user_var, ... – Specifies a comma-separated list of one or more column names or user variables that identify which columns to load by name. The name of a user variable used for this purpose must match the name of an element from the XML file, prefixed with @. You can employ user variables to store the corresponding field values for subsequent reuse.

    For example, the following command loads the first column from the input file into the first column of table1, and sets the value of the table_column2 column in table1 to the input value of the second column divided by 100.

    LOAD DATA FROM S3 's3://mybucket/data.txt'
      INTO TABLE table1
      (column1, @var1)
      SET table_column2 = @var1/100;
  • SET – Specifies a comma-separated list of assignment operations that set the values of columns in the table to values not included in the input file.

    For example, the following command sets the first two columns of table1 to the values in the first two columns from the input file, and then sets the value of the column3 in table1 to the current time stamp.

    LOAD DATA FROM S3  's3://mybucket/data.txt'
      INTO TABLE table1
      (column1, column2)
      SET column3 = CURRENT_TIMESTAMP;

    You can use subqueries in the right side of SET assignments. For a subquery that returns a value to be assigned to a column, you can use only a scalar subquery. Also, you cannot use a subquery to select from the table that is being loaded.

You cannot use the LOCAL keyword of the LOAD DATA FROM S3 command if you are loading data from an Amazon S3 bucket.

Examples

The following command loads data from an Amazon S3 bucket that is in the same region as the Aurora DB cluster. The command reads the comma-delimited data in the file customerdata.txt that is in the dbbucket Amazon S3 bucket, and then loads the data into the table store-schema.customer-table.

LOAD DATA FROM S3 's3://dbbucket/customerdata.csv' 
INTO TABLE store-schema.customer-table
      FIELDS TERMINATED BY ','
      LINES TERMINATED BY '\n'
      (ID, FIRSTNAME, LASTNAME, ADDRESS, EMAIL, PHONE);

The following command loads data from an Amazon S3 bucket that is in a different region from the Aurora DB cluster. The command reads the comma-delimited data from all files that match the employee-data object prefix in themy-data Amazon S3 bucket in the us-west-2 region, and then loads the data into the employees table.


LOAD DATA FROM S3 PREFIX 's3-us-west-2://my-data/employee_data'
      INTO TABLE employees
      FIELDS TERMINATED BY ','
      LINES TERMINATED BY '\n'
      (ID, FIRSTNAME, LASTNAME, EMAIL, SALARY); 
                

LOAD XML FROM S3

You can use the LOAD XML FROM S3 command to load data from XML files stored on an Amazon S3 bucket in one of three different XML formats:

  • Column names as attributes of a <row> element. The attribute value identifies the contents of the table field.

    <row column1="value1" column2="value2" .../>
  • Column names as child elements of a <row> element. The value of the child element identifies the contents of the table field.

    <row>
      <column1>value1</column1>
      <column2>value2</column2>
    </row>
  • Column names in the name attribute of <field> elements in a <row> element. The value of the <field> element identifies the contents of the table field.

    <row>
      <field name='column1'>value1</field>
      <field name='column2'>value2</field>
    </row>

Syntax

LOAD XML FROM S3 'S3-URI'
    [REPLACE | IGNORE]
    INTO TABLE tbl_name
    [PARTITION (partition_name,...)]
    [CHARACTER SET charset_name]
    [ROWS IDENTIFIED BY '<element-name>']
    [IGNORE number {LINES | ROWS}]
    [(field_name_or_user_var,...)]
    [SET col_name = expr,...]

Parameters

Following, you can find a list of the required and optional parameters used by the LOAD DATA FROM S3 command. You can find more details about some of these parameters in LOAD XML Syntax in the MySQL documentation.

  • FILE | PREFIX – Identifies whether to load the data from a single file, or from all files that match a given prefix. FILE is the default.

  • REPLACE | IGNORE – Determines what action to take if an input row as the same unique key values as an existing row in the database table.

    • Specify REPLACE if you want the input row to replace the existing row in the table.

    • Specify IGNORE if you want to discard the input row. IGNORE is the default.

  • INTO TABLE – Identifies the name of the database table to load the input rows into.

  • PARTITION – Requires that all input rows be inserted into the partitions identified by the specified list of comma-separated partition names. If an input row cannot be inserted into one of the specified partitions, then the command fails and an error is returned.

  • CHARACTER SET – Identifies the character set of the data in the input file.

  • ROWS IDENTIFIED BY – Identifies the element name that identifies a row in the input file. The default is <row>.

  • IGNORE number LINES | ROWS – Specifies to ignore a certain number of lines or rows at the start of the input file. For example, you can use IGNORE 1 LINES to skip over the first line in the text file, or IGNORE 2 ROWS to skip over the first two rows of data in the input XML.

  • field_name_or_user_var, ... – Specifies a comma-separated list of one or more XML element names or user variables that identify which elements to load by name. The name of a user variable used for this purpose must match the name of an element from the XML file, prefixed with @. You can employ user variables to store the corresponding field values for subsequent reuse.

    For example, the following command loads the first column from the input file into the first column of table1, and sets the value of the table_column2 column in table1 to the input value of the second column divided by 100.

    LOAD XML FROM S3 's3://mybucket/data.xml'
      INTO TABLE table1
      (column1, @var1)
      SET table_column2 = @var1/100;
  • SET – Specifies a comma-separated list of assignment operations that set the values of columns in the table to values not included in the input file.

    For example, the following command sets the first two columns of table1 to the values in the first two columns from the input file, and then sets the value of the column3 in table1 to the current time stamp.

    LOAD XML FROM S3 's3://mybucket/data.xml'
      INTO TABLE table1
      (column1, column2)
      SET column3 = CURRENT_TIMESTAMP;

    You can use subqueries in the right side of SET assignments. For a subquery that returns a value to be assigned to a column, you can use only a scalar subquery. Also, you cannot use a subquery to select from the table that is being loaded.