Preparing Amazon Redshift as a target for AWS DMS - Optimizing AWS Database Migration Service Performance with Amazon Redshift as Target

Preparing Amazon Redshift as a target for AWS DMS

For the purpose of this whitepaper, assume Amazon Redshift is a target for the AWS DMS service to migrate data from any source database. Refer to the complete list of AWS DMS targets: Targets for AWS DMS.

As mentioned before, the Amazon Redshift cluster must be in the same AWS account and the same AWS Region as the replication instance. This is because, during the data migration, AWS DMS copies data in form of .csv files over to an Amazon S3 bucket on your account before moving it to the tables in AWS Redshift data warehouse. For more information, refer to Prerequisites for using an Amazon Redshift database as a target for AWS Database Migration Service.

During the CDC phase of a task, AWS DMS uses a single thread to read change data from the source and apply changes to the target. Because the thread can handle only a certain number of transactions at a time, depending on the rate of changes on the source, sometimes the thread can't keep the target in sync with the source. This happens more often when Amazon Redshift is the target for AWS DMS because, batch transactions are performed more efficiently in OLAP engines. Amazon Redshift is optimized to run complex analytic queries and is optimized for aggregations on large data sets. Amazon Redshift performance can be affected when running transactional changes one-by-one from an OLTP database. You may see high target latencies during the duration of the time where AWS DMS runs transactions in a one-by-one mode.

Also, the level of parallelism applied depends on the correlation between the total batch size and the maximum file size used to transfer data. When using multithreaded CDC task settings with a Redshift target, benefits are gained when batch size is large in relation to the maximum file size.

Based on the information above, you can use the following extra connection attributes and task settings to tune for optimal CDC performance.

// Redshift endpoint setting maxFileSize=250000;dateFormat=auto;timeFormat=auto;acceptAnyDate= true;fileTransferUploadStreams=20;compUpdate=false

maxFileSize (default value 32MB) specifies the maximum size (in KB) of any .csv file used to transfer data to Amazon Redshift. Dateformat by default is “YYYY-MM-DD”. Using “auto” for the dateformat string helps recognize several different formats. Similarly, using “auto” for timeformat field recognizes different formats. fileTransferUploadStreams specifies the number of threads used to upload a single file. compUpdate=false disables the automatic compression and existing column encodings aren't changed.

// AWS DMS Task settings BatchApplyEnabled= true BatchSplitSize= 0 BatchApplyTimeoutMax = 1800 BatchApplyTimeoutMin = 1800 BatchApplyMemoryLimit = 1024 ParallelApplyThreads = 32 ParallelApplyBufferSize = 100

Regarding the Batch Apply Task settings, you enable the Batch Apply mode, and set the value for the amount of time DMS has to wait between each application of batch changes via the BatchApplyTimeoutMin and BatchApplyTimeoutMax settings. It is important to determine the replication "requirements" and to set this value accordingly. You can set these parameters to a higher value if you need the data to be refreshed more than once every 30 minutes. You also configure the BatchSplitSize as 0, which implies that there is no limit on the maximum number of changes applied in a single batch. Increase the BatchApplyMemoryLimit to 1024 so that more records can be processed in each commit.

ParallelApplyThreads specifies the number of concurrent threads that AWS DMS uses during a CDC load to push data records to the Amazon Redshift target endpoint. ParallelApplyBufferSize specifies the maximum data record requests while using parallel apply threads with Redshift target.

Using the settings above, customers with heavy transactional workloads can benefit by their having an unlimited batch buffer getting filling up to the extent possible in 1800 seconds, utilizing 32 parallel threads with a 250 MB maximum file size.

Regarding the memory allocation guidelines for CDC, when ParallelApplyThreads > 0, configure the memory parameters based on the maxFileSize value mentioned previously. For example, if you are using ten parallel apply threads with maxFileSize as default, AWS DMS will be generating ten CSV files each of 32MB size in memory. Therefore, memory consumption will be 32MB * ten = 320MB. As a result, based on the available memory on the DMS replication instance, it is a good idea to allocate higher value for BatchApplyMemoryLimit in the batch settings (at least 320MB in this example) to enable all the ten threads to create full 32MB files in memory rather than disk, thereby improving performance.

For speeding up full load process, apart from creating table segments which can be loaded in parallel as mentioned in the section above, you can also define ParallelLoadThreads, which specifies the number of concurrent threads that AWS DMS uses during a full load, to push data records to an Amazon Redshift target endpoint. When using parallel load, depending on the number of parallel threads configured, the memory requirement changes. The formula below can be useful for computing the amount of memory required by DMS replication instance.

Minimum memory consumed:

Number of MaxFullLoadSubTasks * number of ParallelLoadThreads * maxfilesize

Recommendation for DMS replication instance memory:

2 * Minimum memory consumed

For example:

8 (MaxFullLoadSubTasks) *32 (ParallelLoadThreads) * 250MB = 64GB (minimum) 64 * 2 = 128 GB (recommended)

When eight threads are used to pull data from the source (MaxFullLoadSubTasks) and 32 threads on DMS end (ParallelLoadThreads) are configured to create one .csv file per thread for loading into Amazon Redshift, with the size of each csv file set as 250MB (maxfilesize), the minimum memory consumed by DMS replication instance is 64GB. It is recommended to select a DMS instance with 128GB memory in this scenario.

If memory on the replication instance is not sufficient, you can set enableParallelBatchInMemoryCSVFiles to false for the files to be written to disk. Note that, if disk is used, there may be a performance drop of about 20% when compared to using memory.

The other aspect which can improve the performance of data loading into Amazon Redshift is the resource utilization of the Amazon Redshift cluster. Monitor for memory and CPU utilization metrics on the Amazon Redshift cluster. It is recommended to identify outlier disk spill queries, which will cause concurrent ETL queries to slow down. You can use this query to identify Workload Management (WLM) queues where the value of query_temp_block_disk is high, indicating intermediate results being written to disk due to insufficient memory. It’s evident that most performance bottlenecks on the cluster are disk related, so reducing commit overhead and paging to disk by rebalancing WLM will give the cluster a throughput increase. You can use AutoWLM instead of customWLM to ensure that memory is allocated automatically to the Amazon Redshift queues based on usage.

A diagram depicting a single Amazon Redshift cluster having multiple workload.

A single Amazon Redshift cluster having multiple workloads

If the Redshift cluster is used by multiple workloads, and is affecting the AWS DMS performance, it’s a good idea to separate out the workloads by using Amazon Redshift data sharing functionality.

A diagram that shows separating workloads and improving Amazon Redshift performance using data sharing functionality .

Separating workloads and improving Amazon Redshift performance using data sharing functionality

Another aspect to consider is when you load data into a table, Amazon Redshift distributes the rows of the table to each of the compute nodes according to the table's distribution style. It's important for large tables with many updates/deletes to choose a distribution key that matches the column of the primary key with the highest cardinality to make it less resource intensive.

AWS recommends pre-creating the target tables on Amazon Redshift with the right distribution keys. If you allow AWS DMS to create the target tables, it does not currently choose a distribution key.

  • Ensure the target tables have Primary keys.

  • Ensure you have distribution keys created for the target tables. Refer to Choose the best distribution style for more information.