Using Amazon S3 as a target for AWS Database Migration Service - AWS Database Migration Service

Using Amazon S3 as a target for AWS Database Migration Service

You can migrate data to Amazon S3 using AWS DMS from any of the supported database sources. When using Amazon S3 as a target in an AWS DMS task, both full load and change data capture (CDC) data is written to comma-separated value (.csv) format by default. For more compact storage and faster query options, you also have the option to have the data written to Apache Parquet (.parquet) format.

AWS DMS names files created during a full load using an incremental hexadecimal counter—for example LOAD00001.csv, LOAD00002..., LOAD00009, LOAD0000A, and so on for .csv files. AWS DMS names CDC files using timestamps, for example 20141029-1134010000.csv. For each source table that contains records, AWS DMS creates a folder under the specified target folder (if the source table is not empty). AWS DMS writes all full load and CDC files to the specified Amazon S3 bucket. You can control the size of the files that AWS DMS creates by using the MaxFileSize endpoint setting.

The parameter bucketFolder contains the location where the .csv or .parquet files are stored before being uploaded to the S3 bucket. With .csv files, table data is stored in the following format in the S3 bucket, shown with full-load files.

database_schema_name/table_name/LOAD00000001.csv database_schema_name/table_name/LOAD00000002.csv ... database_schema_name/table_name/LOAD00000009.csv database_schema_name/table_name/LOAD0000000A.csv database_schema_name/table_name/LOAD0000000B.csv ...database_schema_name/table_name/LOAD0000000F.csv database_schema_name/table_name/LOAD00000010.csv ...

You can specify the column delimiter, row delimiter, and other parameters using the extra connection attributes. For more information on the extra connection attributes, see Endpoint settings when using Amazon S3 as a target for AWS DMS at the end of this section.

You can specify a bucket owner and prevent sniping by using the ExpectedBucketOwner Amazon S3 endpoint setting, as shown following. Then, when you make a request to test a connection or perform a migration, S3 checks the account ID of the bucket owner against the specified parameter.

--s3-settings='{"ExpectedBucketOwner": "AWS_Account_ID"}'

When you use AWS DMS to replicate data changes using a CDC task, the first column of the .csv or .parquet output file indicates how the row data was changed as shown for the following .csv file.

I,101,Smith,Bob,4-Jun-14,New York U,101,Smith,Bob,8-Oct-15,Los Angeles U,101,Smith,Bob,13-Mar-17,Dallas D,101,Smith,Bob,13-Mar-17,Dallas

For this example, suppose that there is an EMPLOYEE table in the source database. AWS DMS writes data to the .csv or .parquet file, in response to the following events:

  • A new employee (Bob Smith, employee ID 101) is hired on 4-Jun-14 at the New York office. In the .csv or .parquet file, the I in the first column indicates that a new row was INSERTed into the EMPLOYEE table at the source database.

  • On 8-Oct-15, Bob transfers to the Los Angeles office. In the .csv or .parquet file, the U indicates that the corresponding row in the EMPLOYEE table was UPDATEd to reflect Bob's new office location. The rest of the line reflects the row in the EMPLOYEE table as it appears after the UPDATE.

  • On 13-Mar,17, Bob transfers again to the Dallas office. In the .csv or .parquet file, the U indicates that this row was UPDATEd again. The rest of the line reflects the row in the EMPLOYEE table as it appears after the UPDATE.

  • After some time working in Dallas, Bob leaves the company. In the .csv or .parquet file, the D indicates that the row was DELETEd in the source table. The rest of the line reflects how the row in the EMPLOYEE table appeared before it was deleted.

Note that by default for CDC, AWS DMS stores the row changes for each database table without regard to transaction order. If you want to store the row changes in CDC files according to transaction order, you need to use S3 endpoint settings to specify this and the folder path where you want the CDC transaction files to be stored on the S3 target. For more information, see Capturing data changes (CDC) including transaction order on the S3 target.

To control the frequency of writes to an Amazon S3 target during a data replication task, you can configure the cdcMaxBatchInterval and cdcMinFileSize extra connection attributes. This can result in better performance when analyzing the data without any additional overhead operations. For more information, see Endpoint settings when using Amazon S3 as a target for AWS DMS

Prerequisites for using Amazon S3 as a target

Before using Amazon S3 as a target, check that the following are true:

  • The S3 bucket that you're using as a target is in the same AWS Region as the DMS replication instance you are using to migrate your data.

  • The AWS account that you use for the migration has an IAM role with write and delete access to the S3 bucket you are using as a target.

  • This role has tagging access so you can tag any S3 objects written to the target bucket.

  • The IAM role has DMS (dms.amazonaws.com) added as trusted entity.

To set up this account access, ensure that the role assigned to the user account used to create the migration task has the following set of permissions.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:DeleteObject", "s3:PutObjectTagging" ], "Resource": [ "arn:aws:s3:::buckettest2/*" ] }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::buckettest2" ] } ] }

For prerequisites for using validation with S3 as a target, see S3 target validation prerequisites.

Limitations to using Amazon S3 as a target

The following limitations apply when using Amazon S3 as a target:

  • Don’t enable versioning for S3. If you need S3 versioning, use lifecycle policies to actively delete old versions. Otherwise, you might encounter endpoint test connection failures because of an S3 list-object call timeout. To create a lifecycle policy for an S3 bucket, see Managing your storage lifecycle. To delete a version of an S3 object, see Deleting object versions from a versioning-enabled bucket.

  • A VPC-enabled (gateway VPC) S3 bucket is supported in versions 3.4.7 and higher.

  • The following data definition language (DDL) commands are supported for change data capture (CDC): Truncate Table, Drop Table, Create Table, Rename Table, Add Column, Drop Column, Rename Column, and Change Column Data Type. Note that when a column is added, dropped, or renamed on the source database, no ALTER statement is recorded in the target S3 bucket, and AWS DMS does not alter previously created records to match the new structure. After the change, AWS DMS creates any new records using the new table structure.

    Note

    A truncate DDL operation removes all files and corresponding table folders from an S3 bucket. You can use task settings to disable that behavior and configure the way DMS handles DDL behavior during change data capture (CDC). For more information, see Task settings for change processing DDL handling.

  • Full LOB mode is not supported.

  • Changes to the source table structure during full load are not supported. Changes to data are supported during full load.

  • Multiple tasks that replicate data from the same source table to the same target S3 endpoint bucket result in those tasks writing to the same file. We recommend that you specify different target endpoints (buckets) if your data source is from the same table.

  • BatchApply is not supported for an S3 endpoint. Using Batch Apply (for example, the BatchApplyEnabled target metadata task setting) for an S3 target might result in loss of data.

  • You can't use DatePartitionEnabled or addColumnName together with PreserveTransactions or CdcPath.

  • AWS DMS doesn't support renaming multiple source tables to the same target folder using transformation rules.

  • If there is intensive writing to the source table during the full load phase, DMS may write duplicate records to the S3 bucket or cached changes.

  • If you configure the task with a TargetTablePrepMode of DO_NOTHING, DMS may write duplicate records to the S3 bucket if the task stops and resumes abruptly during the full load phase.

  • If you configure the target endpoint with a PreserveTransactions setting of true, reloading a table doesn't clear previously generated CDC files. For more information, see Capturing data changes (CDC) including transaction order on the S3 target.

For limitations for using validation with S3 as a target, see Limitations for using S3 target validation.

Security

To use Amazon S3 as a target, the account used for the migration must have write and delete access to the Amazon S3 bucket that is used as the target. Specify the Amazon Resource Name (ARN) of an IAM role that has the permissions required to access Amazon S3.

AWS DMS supports a set of predefined grants for Amazon S3, known as canned access control lists (ACLs). Each canned ACL has a set of grantees and permissions that you can use to set permissions for the Amazon S3 bucket. You can specify a canned ACL using the cannedAclForObjects on the connection string attribute for your S3 target endpoint. For more information about using the extra connection attribute cannedAclForObjects, see Endpoint settings when using Amazon S3 as a target for AWS DMS. For more information about Amazon S3 canned ACLs, see Canned ACL.

The IAM role that you use for the migration must be able to perform the s3:PutObjectAcl API operation.

Using Apache Parquet to store Amazon S3 objects

The comma-separated value (.csv) format is the default storage format for Amazon S3 target objects. For more compact storage and faster queries, you can instead use Apache Parquet (.parquet) as the storage format.

Apache Parquet is an open-source file storage format originally designed for Hadoop. For more information on Apache Parquet, see https://parquet.apache.org/.

To set .parquet as the storage format for your migrated S3 target objects, you can use the following mechanisms:

Amazon S3 object tagging

You can tag Amazon S3 objects that a replication instance creates by specifying appropriate JSON objects as part of task-table mapping rules. For more information about requirements and options for S3 object tagging, including valid tag names, see Object tagging in the Amazon Simple Storage Service User Guide. For more information about table mapping using JSON, see Specifying table selection and transformations rules using JSON.

You tag S3 objects created for specified tables and schemas by using one or more JSON objects of the selection rule type. You then follow this selection object (or objects) by one or more JSON objects of the post-processing rule type with add-tag action. These post-processing rules identify the S3 objects that you want to tag and specify the names and values of the tags that you want to add to these S3 objects.

You can find the parameters to specify in JSON objects of the post-processing rule type in the following table.

Parameter Possible values Description
rule-type post-processing

A value that applies post-processing actions to the generated target objects. You can specify one or more post-processing rules to tag selected S3 objects.

rule-id A numeric value. A unique numeric value to identify the rule.
rule-name An alphanumeric value. A unique name to identify the rule.
rule-action add-tag The post-processing action that you want to apply to the S3 object. You can add one or more tags using a single JSON post-processing object for the add-tag action.
object-locator

schema-name – The name of the table schema.

table-name – The name of the table.

The name of each schema and table to which the rule applies. You can use the "%" percent sign as a wildcard for all or part of the value of each object-locator parameter. Thus, you can match these items:

  • A single table in a single schema

  • A single table in some or all schemas

  • Some or all tables in a single schema

  • Some or all tables in some or all schemas

tag-set

key – Any valid name for a single tag.

value – Any valid JSON value for this tag.

The names and values for one or more tags that you want to set on each created S3 object that matches the specified object-locator. You can specify up to 10 key-value pairs in a single tag-set parameter object. For more information on S3 object tagging, see Object tagging in the Amazon Simple Storage Service User Guide.

You can also specify a dynamic value for all or part of the value for both the key and value parameters of a tag using ${dyn-value}. Here, ${dyn-value} can be either ${schema-name} or ${table-name}. Thus, you can insert the name of the currently selected schema or table as the whole or any part of the parameter value.

Note
Important

If you insert a dynamic value for the key parameter, you can generate tags with duplicate names for an S3 object, depending on how you use it. In this case, only one of the duplicate tag settings is added to the object.

When you specify multiple post-processing rule types to tag a selection of S3 objects, each S3 object is tagged using only one tag-set object from one post-processing rule. The particular tag set used to tag a given S3 object is the one from the post-processing rule whose associated object locator best matches that S3 object.

For example, suppose that two post-processing rules identify the same S3 object. Suppose also that the object locator from one rule uses wildcards and the object locator from the other rule uses an exact match to identify the S3 object (without wildcards). In this case, the tag set associated with the post-processing rule with the exact match is used to tag the S3 object. If multiple post-processing rules match a given S3 object equally well, the tag set associated with the first such post-processing rule is used to tag the object.

Example Adding static tags to an S3 object created for a single table and schema

The following selection and post-processing rules add three tags (tag_1, tag_2, and tag_3 with corresponding static values value_1, value_2, and value_3) to a created S3 object. This S3 object corresponds to a single table in the source named STOCK with a schema named aat2.

{ "rules": [ { "rule-type": "selection", "rule-id": "5", "rule-name": "5", "object-locator": { "schema-name": "aat2", "table-name": "STOCK" }, "rule-action": "include" }, { "rule-type": "post-processing", "rule-id": "41", "rule-name": "41", "rule-action": "add-tag", "object-locator": { "schema-name": "aat2", "table-name": "STOCK" }, "tag-set": [ { "key": "tag_1", "value": "value_1" }, { "key": "tag_2", "value": "value_2" }, { "key": "tag_3", "value": "value_3" } ] } ] }
Example Adding static and dynamic tags to S3 objects created for multiple tables and schemas

The following example has one selection and two post-processing rules, where input from the source includes all tables and all of their schemas.

{ "rules": [ { "rule-type": "selection", "rule-id": "1", "rule-name": "1", "object-locator": { "schema-name": "%", "table-name": "%" }, "rule-action": "include" }, { "rule-type": "post-processing", "rule-id": "21", "rule-name": "21", "rule-action": "add-tag", "object-locator": { "schema-name": "%", "table-name": "%", }, "tag-set": [ { "key": "dw-schema-name", "value":"${schema-name}" }, { "key": "dw-schema-table", "value": "my_prefix_${table-name}" } ] }, { "rule-type": "post-processing", "rule-id": "41", "rule-name": "41", "rule-action": "add-tag", "object-locator": { "schema-name": "aat", "table-name": "ITEM", }, "tag-set": [ { "key": "tag_1", "value": "value_1" }, { "key": "tag_2", "value": "value_2" } ] } ] }

The first post-processing rule adds two tags (dw-schema-name and dw-schema-table) with corresponding dynamic values (${schema-name} and my_prefix_${table-name}) to almost all S3 objects created in the target. The exception is the S3 object identified and tagged with the second post-processing rule. Thus, each target S3 object identified by the wildcard object locator is created with tags that identify the schema and table to which it corresponds in the source.

The second post-processing rule adds tag_1 and tag_2 with corresponding static values value_1 and value_2 to a created S3 object that is identified by an exact-match object locator. This created S3 object thus corresponds to the single table in the source named ITEM with a schema named aat. Because of the exact match, these tags replace any tags on this object added from the first post-processing rule, which matches S3 objects by wildcard only.

Example Adding both dynamic tag names and values to S3 objects

The following example has two selection rules and one post-processing rule. Here, input from the source includes just the ITEM table in either the retail or wholesale schema.

{ "rules": [ { "rule-type": "selection", "rule-id": "1", "rule-name": "1", "object-locator": { "schema-name": "retail", "table-name": "ITEM" }, "rule-action": "include" }, { "rule-type": "selection", "rule-id": "1", "rule-name": "1", "object-locator": { "schema-name": "wholesale", "table-name": "ITEM" }, "rule-action": "include" }, { "rule-type": "post-processing", "rule-id": "21", "rule-name": "21", "rule-action": "add-tag", "object-locator": { "schema-name": "%", "table-name": "ITEM", }, "tag-set": [ { "key": "dw-schema-name", "value":"${schema-name}" }, { "key": "dw-schema-table", "value": "my_prefix_ITEM" }, { "key": "${schema-name}_ITEM_tag_1", "value": "value_1" }, { "key": "${schema-name}_ITEM_tag_2", "value": "value_2" } ] ] }

The tag set for the post-processing rule adds two tags (dw-schema-name and dw-schema-table) to all S3 objects created for the ITEM table in the target. The first tag has the dynamic value "${schema-name}" and the second tag has a static value, "my_prefix_ITEM". Thus, each target S3 object is created with tags that identify the schema and table to which it corresponds in the source.

In addition, the tag set adds two additional tags with dynamic names (${schema-name}_ITEM_tag_1 and "${schema-name}_ITEM_tag_2"). These have the corresponding static values value_1 and value_2. Thus, these tags are each named for the current schema, retail or wholesale. You can't create a duplicate dynamic tag name in this object, because each object is created for a single unique schema name. The schema name is used to create an otherwise unique tag name.

Creating AWS KMS keys to encrypt Amazon S3 target objects

You can create and use custom AWS KMS keys to encrypt your Amazon S3 target objects. After you create a KMS key, you can use it to encrypt objects using one of the following approaches when you create the S3 target endpoint:

  • Use the following options for S3 target objects (with the default .csv file storage format) when you run the create-endpoint command using the AWS CLI.

    --s3-settings '{"ServiceAccessRoleArn": "your-service-access-ARN", "CsvRowDelimiter": "\n", "CsvDelimiter": ",", "BucketFolder": "your-bucket-folder", "BucketName": "your-bucket-name", "EncryptionMode": "SSE_KMS", "ServerSideEncryptionKmsKeyId": "your-KMS-key-ARN"}'

    Here, your-KMS-key-ARN is the Amazon Resource Name (ARN) for your KMS key. For more information, see Using data encryption, parquet files, and CDC on your Amazon S3 target.

  • Set the extra connection attribute encryptionMode to the value SSE_KMS and the extra connection attribute serverSideEncryptionKmsKeyId to the ARN for your KMS key. For more information, see Endpoint settings when using Amazon S3 as a target for AWS DMS.

To encrypt Amazon S3 target objects using a KMS key, you need an IAM role that has permissions to access the Amazon S3 bucket. This IAM role is then accessed in a policy (a key policy) attached to the encryption key that you create. You can do this in your IAM console by creating the following:

  • A policy with permissions to access the Amazon S3 bucket.

  • An IAM role with this policy.

  • A KMS key encryption key with a key policy that references this role.

The following procedures describe how to do this.

To create an IAM policy with permissions to access the Amazon S3 bucket
  1. Open the IAM console at https://console.aws.amazon.com/iam/.

  2. In the navigation pane, choose Policies in the navigation pane. The Policies page opens.

  3. Choose Create policy. The Create policy page opens.

  4. Choose Service and choose S3. A list of action permissions appears.

  5. Choose Expand all to expand the list and choose the following permissions at a minimum:

    • ListBucket

    • PutObject

    • DeleteObject

    Choose any other permissions you need, and then choose Collapse all to collapse the list.

  6. Choose Resources to specify the resources that you want to access. At a minimum, choose All resources to provide general Amazon S3 resource access.

  7. Add any other conditions or permissions you need, then choose Review policy. Check your results on the Review policy page.

  8. If the settings are what you need, enter a name for the policy (for example, DMS-S3-endpoint-access), and any description, then choose Create policy. The Policies page opens with a message indicating that your policy has been created.

  9. Search for and choose the policy name in the Policies list. The Summary page appears displaying JSON for the policy similar to the following.

    { "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "s3:PutObject", "s3:ListBucket", "s3:DeleteObject" ], "Resource": "*" } ] }

You have now created the new policy to access Amazon S3 resources for encryption with a specified name, for example DMS-S3-endpoint-access.

To create an IAM role with this policy
  1. On your IAM console, choose Roles in the navigation pane. The Roles detail page opens.

  2. Choose Create role. The Create role page opens.

  3. With AWS service selected as the trusted entity, choose DMS as the service to use the IAM role.

  4. Choose Next: Permissions. The Attach permissions policies view appears in the Create role page.

  5. Find and select the IAM policy for the IAM role that you created in the previous procedure (DMS-S3-endpoint-access).

  6. Choose Next: Tags. The Add tags view appears in the Create role page. Here, you can add any tags you want.

  7. Choose Next: Review. The Review view appears in the Create role page. Here, you can verify the results.

  8. If the settings are what you need, enter a name for the role (required, for example, DMS-S3-endpoint-access-role), and any additional description, then choose Create role. The Roles detail page opens with a message indicating that your role has been created.

You have now created the new role to access Amazon S3 resources for encryption with a specified name, for example, DMS-S3-endpoint-access-role.

To create a KMS key encryption key with a key policy that references your IAM role
Note

For more information about how AWS DMS works with AWS KMS encryption keys, see Setting an encryption key and specifying AWS KMS permissions.

  1. Sign in to the AWS Management Console and open the AWS Key Management Service (AWS KMS) console at https://console.aws.amazon.com/kms.

  2. To change the AWS Region, use the Region selector in the upper-right corner of the page.

  3. In the navigation pane, choose Customer managed keys.

  4. Choose Create key. The Configure key page opens.

  5. For Key type, choose Symmetric.

    Note

    When you create this key, you can only create a symmetric key, because all AWS services, such as Amazon S3, only work with symmetric encryption keys.

  6. Choose Advanced Options. For Key material origin, make sure that KMS is chosen, then choose Next. The Add labels page opens.

  7. For Create alias and description, enter an alias for the key (for example, DMS-S3-endpoint-encryption-key) and any additional description.

  8. For Tags, add any tags that you want to help identify the key and track its usage, then choose Next. The Define key administrative permissions page opens showing a list of users and roles that you can choose from.

  9. Add the users and roles that you want to manage the key. Make sure that these users and roles have the required permissions to manage the key.

  10. For Key deletion, choose whether key administrators can delete the key, then choose Next. The Define key usage permissions page opens showing an additional list of users and roles that you can choose from.

  11. For This account, choose the available users you want to perform cryptographic operations on Amazon S3 targets. Also choose the role that you previously created in Roles to enable access to encrypt Amazon S3 target objects, for example DMS-S3-endpoint-access-role).

  12. If you want to add other accounts not listed to have this same access, for Other AWS accounts, choose Add another AWS account, then choose Next. The Review and edit key policy page opens, showing the JSON for the key policy that you can review and edit by typing into the existing JSON. Here, you can see where the key policy references the role and users (for example, Admin and User1) that you chose in the previous step. You can also see the different key actions permitted for the different principals (users and roles), as shown in the example following.

    { "Id": "key-consolepolicy-3", "Version": "2012-10-17", "Statement": [ { "Sid": "Enable IAM User Permissions", "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::111122223333:root" ] }, "Action": "kms:*", "Resource": "*" }, { "Sid": "Allow access for Key Administrators", "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::111122223333:role/Admin" ] }, "Action": [ "kms:Create*", "kms:Describe*", "kms:Enable*", "kms:List*", "kms:Put*", "kms:Update*", "kms:Revoke*", "kms:Disable*", "kms:Get*", "kms:Delete*", "kms:TagResource", "kms:UntagResource", "kms:ScheduleKeyDeletion", "kms:CancelKeyDeletion" ], "Resource": "*" }, { "Sid": "Allow use of the key", "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::111122223333:role/DMS-S3-endpoint-access-role", "arn:aws:iam::111122223333:role/Admin", "arn:aws:iam::111122223333:role/User1" ] }, "Action": [ "kms:Encrypt", "kms:Decrypt", "kms:ReEncrypt*", "kms:GenerateDataKey*", "kms:DescribeKey" ], "Resource": "*" }, { "Sid": "Allow attachment of persistent resources", "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::111122223333:role/DMS-S3-endpoint-access-role", "arn:aws:iam::111122223333:role/Admin", "arn:aws:iam::111122223333:role/User1" ] }, "Action": [ "kms:CreateGrant", "kms:ListGrants", "kms:RevokeGrant" ], "Resource": "*", "Condition": { "Bool": { "kms:GrantIsForAWSResource": true } } } ]
  13. Choose Finish. The Encryption keys page opens with a message indicating that your KMS key has been created.

You have now created a new KMS key with a specified alias (for example, DMS-S3-endpoint-encryption-key). This key enables AWS DMS to encrypt Amazon S3 target objects.

Using date-based folder partitioning

AWS DMS supports S3 folder partitions based on a transaction commit date when you use Amazon S3 as your target endpoint. Using date-based folder partitioning, you can write data from a single source table to a time-hierarchy folder structure in an S3 bucket. By partitioning folders when creating an S3 target endpoint, you can do the following:

  • Better manage your S3 objects

  • Limit the size of each S3 folder

  • Optimize data lake queries or other subsequent operations

You can enable date-based folder partitioning when you create an S3 target endpoint. You can enable it when you either migrate existing data and replicate ongoing changes (full load + CDC), or replicate data changes only (CDC only). Use the following target endpoint settings:

  • DatePartitionEnabled – Specifies partitioning based on dates. Set this Boolean option to true to partition S3 bucket folders based on transaction commit dates.

    You can't use this setting with PreserveTransactions or CdcPath.

    The default value is false.

  • DatePartitionSequence – Identifies the sequence of the date format to use during folder partitioning. Set this ENUM option to YYYYMMDD, YYYYMMDDHH, YYYYMM, MMYYYYDD, or DDMMYYYY. The default value is YYYYMMDD. Use this setting when DatePartitionEnabled is set to true.

  • DatePartitionDelimiter – Specifies a date separation delimiter to use during folder partitioning. Set this ENUM option to SLASH, DASH, UNDERSCORE, or NONE. The default value is SLASH. Use this setting when DatePartitionEnabled is set to true.

The following example shows how to enable date-based folder partitioning, with default values for the data partition sequence and the delimiter. It uses the --s3-settings '{json-settings}' option of the AWS CLI.create-endpoint command.

--s3-settings '{"DatePartitionEnabled": true,"DatePartitionSequence": "YYYYMMDD","DatePartitionDelimiter": "SLASH"}'

Parallel load of partitioned sources when using Amazon S3 as a target for AWS DMS

You can configure a parallel full load of partitioned data sources to Amazon S3 targets. This approach improves the load times for migrating partitioned data from supported source database engines to the S3 target. To improve the load times of partitioned source data, you create S3 target subfolders mapped to the partitions of every table in the source database. These partition-bound subfolders allow AWS DMS to run parallel processes to populate each subfolder on the target.

To configure a parallel full load of an S3 target, S3 supports three parallel-load rule types for the table-settings rule of table mapping:

  • partitions-auto

  • partitions-list

  • ranges

For more information on these parallel-load rule types, see Table and collection settings rules and operations.

For the partitions-auto and partitions-list rule types, AWS DMS uses each partition name from the source endpoint to identify the target subfolder structure, as follows.

bucket_name/bucket_folder/database_schema_name/table_name/partition_name/LOADseq_num.csv

Here, the subfolder path where data is migrated and stored on the S3 target includes an additional partition_name subfolder that corresponds to a source partition with the same name. This partition_name subfolder then stores one or more LOADseq_num.csv files containing data migrated from the specified source partition. Here, seq_num is the sequence number postfix on the .csv file name, such as 00000001 in the .csv file with the name, LOAD00000001.csv.

However, some database engines, such as MongoDB and DocumentDB, don't have the concept of partitions. For these database engines, AWS DMS adds the running source segment index as a prefix to the target .csv file name, as follows.

.../database_schema_name/table_name/SEGMENT1_LOAD00000001.csv .../database_schema_name/table_name/SEGMENT1_LOAD00000002.csv ... .../database_schema_name/table_name/SEGMENT2_LOAD00000009.csv .../database_schema_name/table_name/SEGMENT3_LOAD0000000A.csv

Here, the files SEGMENT1_LOAD00000001.csv and SEGMENT1_LOAD00000002.csv are named with the same running source segment index prefix, SEGMENT1. They're named as so because the migrated source data for these two .csv files is associated with the same running source segment index. On the other hand, the migrated data stored in each of the target SEGMENT2_LOAD00000009.csv and SEGMENT3_LOAD0000000A.csv files is associated with different running source segment indexes. Each file has its file name prefixed with the name of its running segment index, SEGMENT2 and SEGMENT3.

For the ranges parallel-load type, you define the column names and column values using the columns and boundaries settings of the table-settings rules. With these rules, you can specify partitions corresponding to segment names, as follows.

"parallel-load": { "type": "ranges", "columns": [ "region", "sale" ], "boundaries": [ [ "NORTH", "1000" ], [ "WEST", "3000" ] ], "segment-names": [ "custom_segment1", "custom_segment2", "custom_segment3" ] }

Here, the segment-names setting defines names for three partitions to migrate data in parallel on the S3 target. The migrated data is parallel-loaded and stored in .csv files under the partition subfolders in order, as follows.

.../database_schema_name/table_name/custom_segment1/LOAD[00000001...].csv .../database_schema_name/table_name/custom_segment2/LOAD[00000001...].csv .../database_schema_name/table_name/custom_segment3/LOAD[00000001...].csv

Here, AWS DMS stores a series of .csv files in each of the three partition subfolders. The series of .csv files in each partition subfolder is named incrementally starting from LOAD00000001.csv until all the data is migrated.

In some cases, you might not explicitly name partition subfolders for a ranges parallel-load type using the segment-names setting. In these case, AWS DMS applies the default of creating each series of .csv files under its table_name subfolder. Here, AWS DMS prefixes the file names of each series of .csv files with the name of the running source segment index, as follows.

.../database_schema_name/table_name/SEGMENT1_LOAD[00000001...].csv .../database_schema_name/table_name/SEGMENT2_LOAD[00000001...].csv .../database_schema_name/table_name/SEGMENT3_LOAD[00000001...].csv ... .../database_schema_name/table_name/SEGMENTZ_LOAD[00000001...].csv

Endpoint settings when using Amazon S3 as a target for AWS DMS

You can use endpoint settings to configure your Amazon S3 target database similar to using extra connection attributes. You specify the settings when you create the target endpoint using the AWS DMS console, or by using the create-endpoint command in the AWS CLI, with the --s3-settings '{"EndpointSetting": "value", ...}' JSON syntax.

The following table shows the endpoint settings that you can use with Amazon S3 as a target.

Option Description
CsvNullValue

An optional parameter that specifies how AWS DMS treats null values. While handling the null value, you can use this parameter to pass a user-defined string as null when writing to the target. For example, when target columns are nullable, you can use this option to differentiate between the empty string value and the null value. So, if you set this parameter value to the empty string (" " or ''), AWS DMS treats the empty string as the null value instead of NULL.

Default value: NULL

Valid values: any valid string

Example: --s3-settings '{"CsvNullValue": " "}'

AddColumnName

An optional parameter that when set to true or y you can use to add column name information to the .csv output file.

You can't use this parameter with PreserveTransactions or CdcPath.

Default value: false

Valid values: true, false, y, n

Example: --s3-settings '{"AddColumnName": true}'

AddTrailingPaddingCharacter

Use the S3 target endpoint setting AddTrailingPaddingCharacter to add padding on string data. The default value is false.

Type: Boolean

Example: --s3-settings '{"AddTrailingPaddingCharacter": true}'

BucketFolder

An optional parameter to set a folder name in the S3 bucket. If provided, target objects are created as .csv or .parquet files in the path BucketFolder/schema_name/table_name/. If this parameter isn't specified, then the path used is schema_name/table_name/.

Example: --s3-settings '{"BucketFolder": "testFolder"}'

BucketName

The name of the S3 bucket where S3 target objects are created as .csv or .parquet files.

Example: --s3-settings '{"BucketName": "buckettest"}'

CannedAclForObjects

A value that enables AWS DMS to specify a predefined (canned) access control list for objects created in the S3 bucket as .csv or .parquet files. For more information about Amazon S3 canned ACLs, see Canned ACL in the Amazon S3 Developer Guide.

Default value: NONE

Valid values for this attribute are: NONE; PRIVATE; PUBLIC_READ; PUBLIC_READ_WRITE; AUTHENTICATED_READ; AWS_EXEC_READ; BUCKET_OWNER_READ; BUCKET_OWNER_FULL_CONTROL.

Example: --s3-settings '{"CannedAclForObjects": "PUBLIC_READ"}'

CdcInsertsOnly

An optional parameter during a change data capture (CDC) load to write only INSERT operations to the comma-separated value (.csv) or columnar storage (.parquet) output files. By default (the false setting), the first field in a .csv or .parquet record contains the letter I (INSERT), U (UPDATE), or D (DELETE). This letter indicates whether the row was inserted, updated, or deleted at the source database for a CDC load to the target. If cdcInsertsOnly is set to true or y, only INSERTs from the source database are migrated to the .csv or .parquet file.

For .csv format only, how these INSERTS are recorded depends on the value of IncludeOpForFullLoad. If IncludeOpForFullLoad is set to true, the first field of every CDC record is set to I to indicate the INSERT operation at the source. If IncludeOpForFullLoad is set to false, every CDC record is written without a first field to indicate the INSERT operation at the source. For more information about how these parameters work together, see Indicating source DB operations in migrated S3 data.

Default value: false

Valid values: true, false, y, n

Example: --s3-settings '{"CdcInsertsOnly": true}'

CdcInsertsAndUpdates

Enables a change data capture (CDC) load to write INSERT and UPDATE operations to .csv or .parquet (columnar storage) output files. The default setting is false, but when cdcInsertsAndUpdates is set to true or y, INSERTs and UPDATEs from the source database are migrated to the .csv or .parquet file.

For .csv file format only, how these INSERTs and UPDATEs are recorded depends on the value of the includeOpForFullLoad parameter. If includeOpForFullLoad is set to true, the first field of every CDC record is set to either I or U to indicate INSERT and UPDATE operations at the source. But if includeOpForFullLoad is set to false, CDC records are written without an indication of INSERT or UPDATE operations at the source.

For more information about how these parameters work together, see Indicating source DB operations in migrated S3 data.

Note

CdcInsertsOnly and cdcInsertsAndUpdates can't both be set to true for the same endpoint. Set either cdcInsertsOnly or cdcInsertsAndUpdates to true for the same endpoint, but not both.

Default value: false

Valid values: true, false, y, n

Example: --s3-settings '{"CdcInsertsAndUpdates": true}'

CdcPath

Specifies the folder path of CDC files. For an S3 source, this setting is required if a task captures change data; otherwise, it's optional. If CdcPath is set, DMS reads CDC files from this path and replicates the data changes to the target endpoint. For an S3 target if you set PreserveTransactions to true, DMS verifies that you have set this parameter to a folder path on your S3 target where DMS can save the transaction order for the CDC load. DMS creates this CDC folder path in either your S3 target working directory or the S3 target location specified by BucketFolder and BucketName.

You can't use this parameter with DatePartitionEnabled or AddColumnName.

Type: String

For example, if you specify CdcPath as MyChangedData, and you specify BucketName as MyTargetBucket but do not specify BucketFolder, DMS creates the following CDC folder path: MyTargetBucket/MyChangedData.

If you specify the same CdcPath, and you specify BucketName as MyTargetBucket and BucketFolder as MyTargetData, DMS creates the following CDC folder path: MyTargetBucket/MyTargetData/MyChangedData.

Note

This setting is supported in AWS DMS versions 3.4.2 and higher.

When capturing data changes in transaction order, DMS always stores the row changes in .csv files regardless of the value of the DataFormat S3 setting on the target. DMS doesn't save data changes in transaction order using .parquet files.

CdcMaxBatchInterval

Maximum interval length condition, defined in seconds, to output a file to Amazon S3.

Default Value: 60 seconds

When CdcMaxBatchInterval is specified and CdcMinFileSize is specified, the file write is triggered by whichever parameter condition is met first.

CdcMinFileSize

Minimum file size condition as defined in kilobytes to output a file to Amazon S3.

Default Value: 32000 KB

When CdcMinFileSize is specified and CdcMaxBatchInterval is specified, the file write is triggered by whichever parameter condition is met first.

PreserveTransactions

If set to true, DMS saves the transaction order for change data capture (CDC) on the Amazon S3 target specified by CdcPath.

You can't use this parameter with DatePartitionEnabled or AddColumnName.

Type: Boolean

When capturing data changes in transaction order, DMS always stores the row changes in .csv files regardless of the value of the DataFormat S3 setting on the target. DMS doesn't save data changes in transaction order using .parquet files.

Note

This setting is supported in AWS DMS versions 3.4.2 and higher.

IncludeOpForFullLoad

An optional parameter during a full load to write the INSERT operations to the comma-separated value (.csv) output files only.

For full load, records can only be inserted. By default (the false setting), there is no information recorded in these output files for a full load to indicate that the rows were inserted at the source database. If IncludeOpForFullLoad is set to true or y, the INSERT is recorded as an I annotation in the first field of the .csv file.

Note

This parameter works together with CdcInsertsOnly or CdcInsertsAndUpdates for output to .csv files only. For more information about how these parameters work together, see Indicating source DB operations in migrated S3 data.

Default value: false

Valid values: true, false, y, n

Example: --s3-settings '{"IncludeOpForFullLoad": true}'

CompressionType

An optional parameter when set to GZIP uses GZIP to compress the target .csv or .parquet files. When this parameter is set to the default, it leaves the files uncompressed.

Default value: NONE

Valid values: GZIP or NONE

Example: --s3-settings '{"CompressionType": "GZIP"}'

CsvDelimiter

The delimiter used to separate columns in .csv source files. The default is a comma (,).

Example: --s3-settings '{"CsvDelimiter": ","}'

CsvRowDelimiter

The delimiter used to separate rows in the .csv source files. The default is a newline (\n).

Example: --s3-settings '{"CsvRowDelimiter": "\n"}'

MaxFileSize

A value that specifies the maximum size (in KB) of any .csv file to be created while migrating to an S3 target during full load.

Default value: 1,048,576 KB (1 GB)

Valid values: 1–1,048,576

Example: --s3-settings '{"MaxFileSize": 512}'

Rfc4180

An optional parameter used to set behavior to comply with RFC for data migrated to Amazon S3 using .csv file format only. When this value is set to true or y using Amazon S3 as a target, if the data has quotation marks, commas, or newline characters in it, AWS DMS encloses the entire column with an additional pair of double quotation marks ("). Every quotation mark within the data is repeated twice. This formatting complies with RFC 4180.

Default value: true

Valid values: true, false, y, n

Example: --s3-settings '{"Rfc4180": false}'

EncryptionMode

The server-side encryption mode that you want to encrypt your .csv or .parquet object files copied to S3. The valid values are SSE_S3 (S3 server-side encryption) or SSE_KMS (KMS key encryption). If you choose SSE_KMS, set the ServerSideEncryptionKmsKeyId parameter to the Amazon Resource Name (ARN) for the KMS key to be used for encryption.

Note

You can also use the CLI modify-endpoint command to change the value of the EncryptionMode attribute for an existing endpoint from SSE_KMS to SSE_S3. But you can’t change the EncryptionMode value from SSE_S3 to SSE_KMS.

Default value: SSE_S3

Valid values: SSE_S3 or SSE_KMS

Example: --s3-settings '{"EncryptionMode": SSE_S3}'

ServerSideEncryptionKmsKeyId

If you set EncryptionMode to SSE_KMS, set this parameter to the Amazon Resource Name (ARN) for the KMS key. You can find this ARN by selecting the key alias in the list of AWS KMS keys created for your account. When you create the key, you must associate specific policies and roles associated with this KMS key. For more information, see Creating AWS KMS keys to encrypt Amazon S3 target objects.

Example: --s3-settings '{"ServerSideEncryptionKmsKeyId":"arn:aws:kms:us-east-1:111122223333:key/11a1a1a1-aaaa-9999-abab-2bbbbbb222a2"}'

DataFormat

The output format for the files that AWS DMS uses to create S3 objects. For Amazon S3 targets, AWS DMS supports either .csv or .parquet files. The .parquet files have a binary columnar storage format with efficient compression options and faster query performance. For more information about .parquet files, see https://parquet.apache.org/.

Default value: csv

Valid values: csv or parquet

Example: --s3-settings '{"DataFormat": "parquet"}'

EncodingType

The Parquet encoding type. The encoding type options include the following:

  • rle-dictionary – This dictionary encoding uses a combination of bit-packing and run-length encoding to more efficiently store repeating values.

  • plain – No encoding.

  • plain-dictionary – This dictionary encoding builds a dictionary of values encountered in a given column. The dictionary is stored in a dictionary page for each column chunk.

Default value: rle-dictionary

Valid values: rle-dictionary, plain, or plain-dictionary

Example: --s3-settings '{"EncodingType": "plain-dictionary"}'

DictPageSizeLimit

The maximum allowed size, in bytes, for a dictionary page in a .parquet file. If a dictionary page exceeds this value, the page uses plain encoding.

Default value: 1,024,000 (1 MB)

Valid values: Any valid integer value

Example: --s3-settings '{"DictPageSizeLimit": 2,048,000}'

RowGroupLength

The number of rows in one row group of a .parquet file.

Default value: 10,024 (10 KB)

Valid values: Any valid integer

Example: --s3-settings '{"RowGroupLength": 20,048}'

DataPageSize

The maximum allowed size, in bytes, for a data page in a .parquet file.

Default value: 1,024,000 (1 MB)

Valid values: Any valid integer

Example: --s3-settings '{"DataPageSize": 2,048,000}'

ParquetVersion

The version of the .parquet file format.

Default value: PARQUET_1_0

Valid values: PARQUET_1_0 or PARQUET_2_0

Example: --s3-settings '{"ParquetVersion": "PARQUET_2_0"}'

EnableStatistics

Set to true or y to enable statistics about .parquet file pages and row groups.

Default value: true

Valid values: true, false, y, n

Example: --s3-settings '{"EnableStatistics": false}'

TimestampColumnName

An optional parameter to include a timestamp column in the S3 target endpoint data.

AWS DMS includes an additional STRING column in the .csv or .parquet object files of your migrated data when you set TimestampColumnName to a non blank value.

For a full load, each row of this timestamp column contains a timestamp for when the data was transferred from the source to the target by DMS.

For a CDC load, each row of the timestamp column contains the timestamp for the commit of that row in the source database.

The string format for this timestamp column value is yyyy-MM-dd HH:mm:ss.SSSSSS. By default, the precision of this value is in microseconds. For a CDC load, the rounding of the precision depends on the commit timestamp supported by DMS for the source database.

When the AddColumnName parameter is set to true, DMS also includes the name for the timestamp column that you set as the non blank value of TimestampColumnName.

Example: --s3-settings '{"TimestampColumnName": "TIMESTAMP"}'

UseTaskStartTimeForFullLoadTimestamp

When set to true, this parameter uses the task start time as the timestamp column value instead of the time data is written to target. For full load, when UseTaskStartTimeForFullLoadTimestamp is set to true, each row of the timestamp column contains the task start time. For CDC loads, each row of the timestamp column contains the transaction commit time.

When UseTaskStartTimeForFullLoadTimestamp is set to false, the full load timestamp in the timestamp column increments with the time data arrives at the target.

Default value: false

Valid values: true, false

Example: --s3-settings '{"UseTaskStartTimeForFullLoadTimestamp": true}'

UseTaskStartTimeForFullLoadTimestamp: true helps make the S3 target TimestampColumnName for a full load sortable with TimestampColumnName for a CDC load.

ParquetTimestampInMillisecond

An optional parameter that specifies the precision of any TIMESTAMP column values written to an S3 object file in .parquet format.

When this attribute is set to true or y, AWS DMS writes all TIMESTAMP columns in a .parquet formatted file with millisecond precision. Otherwise, DMS writes them with microsecond precision.

Currently, Amazon Athena and AWS Glue can handle only millisecond precision for TIMESTAMP values. Set this attribute to true for .parquet formatted S3 endpoint object files only if you plan to query or process the data with Athena or AWS Glue.

Note
  • AWS DMS writes any TIMESTAMP column values written to an S3 file in .csv format with microsecond precision.

  • The setting of this attribute has no effect on the string format of the timestamp column value inserted by setting the TimestampColumnName attribute.

Default value: false

Valid values: true, false, y, n

Example: --s3-settings '{"ParquetTimestampInMillisecond": true}'

GlueCatalogGeneration

To generate an AWS Glue Data Catalog, set this endpoint setting to true.

Default value: false

Valid values: true, false,

Example: --s3-settings '{"GlueCatalogGeneration": true}'

Note: Don't use GlueCatalogGeneration with PreserveTransactions and CdcPath.

Using AWS Glue Data Catalog with an Amazon S3 target for AWS DMS

AWS Glue is a service that provides simple ways to categorize data, and consists of a metadata repository known as AWS Glue Data Catalog. You can integrate AWS Glue Data Catalog with your Amazon S3 target endpoint and query Amazon S3 data through other AWS services such as Amazon Athena. Amazon Redshift works with AWS Glue but AWS DMS doesn't support that as a pre-built option.

To generate the data catalog, set the GlueCatalogGeneration endpoint setting to true, as shown in the following AWS CLI example.

aws dms create-endpoint --endpoint-identifier s3-target-endpoint --engine-name s3 --endpoint-type target--s3-settings '{"ServiceAccessRoleArn": "your-service-access-ARN", "BucketFolder": "your-bucket-folder", "BucketName": "your-bucket-name", "DataFormat": "parquet", "GlueCatalogGeneration": true}'

For a Full load replication task that includes csv type data, set IncludeOpForFullLoad to true.

Don't use GlueCatalogGeneration with PreserveTransactions and CdcPath. The AWS Glue crawler can't reconcile the different schemas of files stored under the specified CdcPath.

For Amazon Athena to index your Amazon S3 data, and for you to query your data using standard SQL queries through Amazon Athena, the IAM role attached to the endpoint must have the following policy:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetBucketLocation", "s3:GetObject", "s3:ListBucket", "s3:ListBucketMultipartUploads", "s3:ListMultipartUploadParts", "s3:AbortMultipartUpload" ], "Resource": [ "arn:aws:s3:::bucket123", "arn:aws:s3:::bucket123/*" ] }, { "Effect": "Allow", "Action": [ "glue:CreateDatabase", "glue:GetDatabase", "glue:CreateTable", "glue:DeleteTable", "glue:UpdateTable", "glue:GetTable", "glue:BatchCreatePartition", "glue:CreatePartition", "glue:UpdatePartition", "glue:GetPartition", "glue:GetPartitions", "glue:BatchGetPartition" ], "Resource": [ "arn:aws:glue:*:111122223333:catalog", "arn:aws:glue:*:111122223333:database/*", "arn:aws:glue:*:111122223333:table/*" ] }, { "Effect": "Allow", "Action": [ "athena:StartQueryExecution", "athena:GetQueryExecution", "athena:CreateWorkGroup" ], "Resource": "arn:aws:athena:*:111122223333:workgroup/glue_catalog_generation_for_task_*" } ] }
References
  • For more information about AWS Glue, see Concepts in the AWS Glue Developer Guide .

  • For more information about AWS Glue Data Catalog see Components in the AWS Glue Developer Guide .

Using data encryption, parquet files, and CDC on your Amazon S3 target

You can use S3 target endpoint settings to configure the following:

  • A custom KMS key to encrypt your S3 target objects.

  • Parquet files as the storage format for S3 target objects.

  • Change data capture (CDC) including transaction order on the S3 target.

  • Integrate AWS Glue Data Catalog with your Amazon S3 target endpoint and query Amazon S3 data through other services such as Amazon Athena.

AWS KMS key settings for data encryption

The following examples show configuring a custom KMS key to encrypt your S3 target objects. To start, you might run the following create-endpoint CLI command.

aws dms create-endpoint --endpoint-identifier s3-target-endpoint --engine-name s3 --endpoint-type target --s3-settings '{"ServiceAccessRoleArn": "your-service-access-ARN", "CsvRowDelimiter": "\n", "CsvDelimiter": ",", "BucketFolder": "your-bucket-folder", "BucketName": "your-bucket-name", "EncryptionMode": "SSE_KMS", "ServerSideEncryptionKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/72abb6fb-1e49-4ac1-9aed-c803dfcc0480"}'

Here, the JSON object specified by --s3-settings option defines two parameters. One is an EncryptionMode parameter with the value SSE_KMS. The other is an ServerSideEncryptionKmsKeyId parameter with the value of arn:aws:kms:us-east-1:111122223333:key/72abb6fb-1e49-4ac1-9aed-c803dfcc0480. This value is an Amazon Resource Name (ARN) for your custom KMS key. For an S3 target, you also specify additional settings. These identify the server access role, provide delimiters for the default CSV object storage format, and give the bucket location and name to store S3 target objects.

By default, S3 data encryption occurs using S3 server-side encryption. For the previous example's S3 target, this is also equivalent to specifying its endpoint settings as in the following example.

aws dms create-endpoint --endpoint-identifier s3-target-endpoint --engine-name s3 --endpoint-type target --s3-settings '{"ServiceAccessRoleArn": "your-service-access-ARN", "CsvRowDelimiter": "\n", "CsvDelimiter": ",", "BucketFolder": "your-bucket-folder", "BucketName": "your-bucket-name", "EncryptionMode": "SSE_S3"}'

For more information about working with S3 server-side encryption, see Protecting data using server-side encryption.

Note

You can also use the CLI modify-endpoint command to change the value of the EncryptionMode parameter for an existing endpoint from SSE_KMS to SSE_S3. But you can’t change the EncryptionMode value from SSE_S3 to SSE_KMS.

Settings for using .parquet files to store S3 target objects

The default format for creating S3 target objects is .csv files. The following examples show some endpoint settings for specifying .parquet files as the format for creating S3 target objects. You can specify the .parquet files format with all the defaults, as in the following example.

aws dms create-endpoint --endpoint-identifier s3-target-endpoint --engine-name s3 --endpoint-type target --s3-settings '{"ServiceAccessRoleArn": "your-service-access-ARN", "DataFormat": "parquet"}'

Here, the DataFormat parameter is set to parquet to enable the format with all the S3 defaults. These defaults include a dictionary encoding ("EncodingType: "rle-dictionary") that uses a combination of bit-packing and run-length encoding to more efficiently store repeating values.

You can add additional settings for options other than the defaults as in the following example.

aws dms create-endpoint --endpoint-identifier s3-target-endpoint --engine-name s3 --endpoint-type target --s3-settings '{"ServiceAccessRoleArn": "your-service-access-ARN", "BucketFolder": "your-bucket-folder", "BucketName": "your-bucket-name", "CompressionType": "GZIP", "DataFormat": "parquet", "EncodingType: "plain-dictionary", "DictPageSizeLimit": 3,072,000, "EnableStatistics": false }'

Here, in addition to parameters for several standard S3 bucket options and the DataFormat parameter, the following additional .parquet file parameters are set:

  • EncodingType – Set to a dictionary encoding (plain-dictionary) that stores values encountered in each column in a per-column chunk of the dictionary page.

  • DictPageSizeLimit – Set to a maximum dictionary page size of 3 MB.

  • EnableStatistics – Disables the default that enables the collection of statistics about Parquet file pages and row groups.

Capturing data changes (CDC) including transaction order on the S3 target

By default when AWS DMS runs a CDC task, it stores all the row changes logged in your source database (or databases) in one or more files for each table. Each set of files containing changes for the same table reside in a single target directory associated with that table. AWS DMS creates as many target directories as database tables migrated to the Amazon S3 target endpoint. The files are stored on the S3 target in these directories without regard to transaction order. For more information on the file naming conventions, data contents, and format, see Using Amazon S3 as a target for AWS Database Migration Service.

To capture source database changes in a manner that also captures the transaction order, you can specify S3 endpoint settings that direct AWS DMS to store the row changes for all database tables in one or more .csv files created depending on transaction size. These .csv transaction files contain all row changes listed sequentially in transaction order for all tables involved in each transaction. These transaction files reside together in a single transaction directory that you also specify on the S3 target. In each transaction file, the transaction operation and the identity of the database and source table for each row change is stored as part of the row data as follows.

operation,table_name,database_schema_name,field_value,...

Here, operation is the transaction operation on the changed row, table_name is the name of the database table where the row is changed, database_schema_name is the name of the database schema where the table resides, and field_value is the first of one or more field values that specify the data for the row.

The example following of a transaction file shows changed rows for one or more transactions that involve two tables.

I,Names_03cdcad11a,rdsTempsdb,13,Daniel U,Names_03cdcad11a,rdsTempsdb,23,Kathy D,Names_03cdcad11a,rdsTempsdb,13,Cathy I,Names_6d152ce62d,rdsTempsdb,15,Jane I,Names_6d152ce62d,rdsTempsdb,24,Chris I,Names_03cdcad11a,rdsTempsdb,16,Mike

Here, the transaction operation on each row is indicated by I (insert), U (update), or D (delete) in the first column. The table name is the second column value (for example, Names_03cdcad11a). The name of the database schema is the value of the third column (for example, rdsTempsdb). And the remaining columns are populated with your own row data (for example, 13,Daniel).

In addition, AWS DMS names the transaction files it creates on the Amazon S3 target using a time stamp according to the following naming convention.

CDC_TXN-timestamp.csv

Here, timestamp is the time when the transaction file was created, as in the following example.

CDC_TXN-20201117153046033.csv

This time stamp in the file name ensures that the transaction files are created and listed in transaction order when you list them in their transaction directory.

Note

When capturing data changes in transaction order, AWS DMS always stores the row changes in .csv files regardless of the value of the DataFormat S3 setting on the target. AWS DMS doesn't save data changes in transaction order using .parquet files.

To control the frequency of writes to an Amazon S3 target during a data replication task, you can configure the CdcMaxBatchInterval and CdcMinFileSize settings. This can result in better performance when analyzing the data without any additional overhead operations. For more information, see Endpoint settings when using Amazon S3 as a target for AWS DMS

To tell AWS DMS to store all row changes in transaction order
  1. Set the PreserveTransactions S3 setting on the target to true.

  2. Set the CdcPath S3 setting on the target to a relative folder path where you want AWS DMS to store the .csv transaction files.

    AWS DMS creates this path either under the default S3 target bucket and working directory or under the bucket and bucket folder that you specify using the BucketName and BucketFolder S3 settings on the target.

Indicating source DB operations in migrated S3 data

When AWS DMS migrates records to an S3 target, it can create an additional field in each migrated record. This additional field indicates the operation applied to the record at the source database. How AWS DMS creates and sets this first field depends on the migration task type and settings of includeOpForFullLoad, cdcInsertsOnly, and cdcInsertsAndUpdates.

For a full load when includeOpForFullLoad is true, AWS DMS always creates an additional first field in each .csv record. This field contains the letter I (INSERT) to indicate that the row was inserted at the source database. For a CDC load when cdcInsertsOnly is false (the default), AWS DMS also always creates an additional first field in each .csv or .parquet record. This field contains the letter I (INSERT), U (UPDATE), or D (DELETE) to indicate whether the row was inserted, updated, or deleted at the source database.

In the following table, you can see how the settings of the includeOpForFullLoad and cdcInsertsOnly attributes work together to affect the setting of migrated records.

With these parameter settings DMS sets target records as follows for .csv and .parquet output
includeOpForFullLoad cdcInsertsOnly For full load For CDC load
true true Added first field value set to I Added first field value set to I
false false No added field Added first field value set to I, U, or D
false true No added field No added field
true false Added first field value set to I Added first field value set to I, U, or D

When includeOpForFullLoad and cdcInsertsOnly are set to the same value, the target records are set according to the attribute that controls record settings for the current migration type. That attribute is includeOpForFullLoad for full load and cdcInsertsOnly for CDC load.

When includeOpForFullLoad and cdcInsertsOnly are set to different values, AWS DMS makes the target record settings consistent for both CDC and full load. It does this by making the record settings for a CDC load conform to the record settings for any earlier full load specified by includeOpForFullLoad.

In other words, suppose that a full load is set to add a first field to indicate an inserted record. In this case, a following CDC load is set to add a first field that indicates an inserted, updated, or deleted record as appropriate at the source. In contrast, suppose that a full load is set to not add a first field to indicate an inserted record. In this case, a CDC load is also set to not add a first field to each record regardless of its corresponding record operation at the source.

Similarly, how DMS creates and sets an additional first field depends on the settings of includeOpForFullLoad and cdcInsertsAndUpdates. In the following table, you can see how the settings of the includeOpForFullLoad and cdcInsertsAndUpdates attributes work together to affect the setting of migrated records in this format.

With these parameter settings DMS sets target records as follows for .csv output
includeOpForFullLoad cdcInsertsAndUpdates For full load For CDC load
true true Added first field value set to I Added first field value set to I or U
false false No added field Added first field value set to I, U, or D
false true No added field Added first field value set to I or U
true false Added first field value set to I Added first field value set to I, U, or D

Target data types for S3 Parquet

The following table shows the Parquet target data types that are supported when using AWS DMS and the default mapping from AWS DMS data types.

For additional information about AWS DMS data types, see Data types for AWS Database Migration Service.

AWS DMS data type

S3 parquet data type

BYTES BINARY
DATE DATE32
TIME TIME32
DATETIME TIMESTAMP
INT1 INT8
INT2 INT16
INT4 INT32
INT8 INT64
NUMERIC DECIMAL
REAL4 FLOAT
REAL8 DOUBLE
STRING STRING
UINT1 UINT8
UINT2 UINT16
UINT4 UINT32
UINT8 UINT64
WSTRING STRING
BLOB BINARY
NCLOB STRING
CLOB STRING
BOOLEAN BOOL