Preparing Amazon Redshift as a target for AWS DMS
For the purpose of this whitepaper, assume Amazon Redshift is a target for the AWS DMS service to migrate data from any source database. Refer to the complete list of AWS DMS targets: Targets for AWS DMS.
As mentioned before, the Amazon Redshift cluster must be in the same AWS account and the same AWS Region as the replication instance. This is because, during the data migration, AWS DMS copies data in form of .csv files over to an Amazon S3 bucket on your account before moving it to the tables in AWS Redshift data warehouse. For more information, refer to Prerequisites for using an Amazon Redshift database as a target for AWS Database Migration Service.
During the CDC phase of a task, AWS DMS uses a single thread to read change data from the source and apply changes to the target. Because the thread can handle only a certain number of transactions at a time, depending on the rate of changes on the source, sometimes the thread can't keep the target in sync with the source. This happens more often when Amazon Redshift is the target for AWS DMS because, batch transactions are performed more efficiently in OLAP engines. Amazon Redshift is optimized to run complex analytic queries and is optimized for aggregations on large data sets. Amazon Redshift performance can be affected when running transactional changes one-by-one from an OLTP database. You may see high target latencies during the duration of the time where AWS DMS runs transactions in a one-by-one mode.
Also, the level of parallelism applied depends on the correlation between the total batch size and the maximum file size used to transfer data. When using multithreaded CDC task settings with a Redshift target, benefits are gained when batch size is large in relation to the maximum file size.
Based on the information above, you can use the following extra connection attributes and task settings to tune for optimal CDC performance.
// Redshift endpoint setting maxFileSize=250000;dateFormat=auto;timeFormat=auto;acceptAnyDate= true;fileTransferUploadStreams=20;compUpdate=false
maxFileSize (default value 32MB) specifies the maximum size (in KB) of any .csv file used to transfer data to Amazon Redshift. Dateformat by default is “YYYY-MM-DD”. Using “auto” for the dateformat string helps recognize several different formats. Similarly, using “auto” for timeformat field recognizes different formats. fileTransferUploadStreams specifies the number of threads used to upload a single file. compUpdate=false disables the automatic compression and existing column encodings aren't changed.
// AWS DMS Task settings BatchApplyEnabled= true BatchSplitSize= 0 BatchApplyTimeoutMax = 1800 BatchApplyTimeoutMin = 1800 BatchApplyMemoryLimit = 1024 ParallelApplyThreads = 32 ParallelApplyBufferSize = 100
Regarding the Batch Apply Task settings, you enable the Batch Apply mode, and set the value for the
amount of time DMS has to wait between each application of batch changes via the
BatchApplyTimeoutMin
and BatchApplyTimeoutMax
settings. It is
important to determine the replication "requirements" and to set this value
accordingly. You can set these parameters to a higher value if you need the data to be refreshed
more than once every 30 minutes. You also configure the BatchSplitSize
as 0, which
implies that there is no limit on the maximum number of changes applied in a single batch.
Increase the BatchApplyMemoryLimit
to 1024 so that more records can be processed in
each commit.
ParallelApplyThreads
specifies the number of concurrent threads that AWS DMS uses
during a CDC load to push data records to the Amazon Redshift target endpoint.
ParallelApplyBufferSize
specifies the maximum data record requests while using
parallel apply threads with Redshift target.
Using the settings above, customers with heavy transactional workloads can benefit by their having an unlimited batch buffer getting filling up to the extent possible in 1800 seconds, utilizing 32 parallel threads with a 250 MB maximum file size.
Regarding the memory allocation guidelines for CDC, when ParallelApplyThreads
> 0, configure the memory parameters based on the maxFileSize
value mentioned
previously. For example, if you are using ten parallel apply threads with
maxFileSize
as default, AWS DMS will be generating ten CSV files each of 32MB
size in memory. Therefore, memory consumption will be 32MB * ten = 320MB. As a result, based on
the available memory on the DMS replication instance, it is a good idea to allocate higher value
for BatchApplyMemoryLimit
in the batch settings (at least 320MB in this example) to
enable all the ten threads to create full 32MB files in memory rather than disk, thereby
improving performance.
For speeding up full load process, apart from creating
table segments which can be loaded in parallel as mentioned in the section above, you can also
define ParallelLoadThreads
, which specifies the number of concurrent threads
that AWS DMS uses during a full load, to push data records to an Amazon Redshift target endpoint.
When using parallel load, depending on the number of parallel threads configured, the memory
requirement changes. The formula below can be useful for computing the amount of memory required
by DMS replication instance.
Minimum memory consumed:
Number of MaxFullLoadSubTasks * number of ParallelLoadThreads * maxfilesize
Recommendation for DMS replication instance memory:
2 * Minimum memory consumed
For example:
8 (MaxFullLoadSubTasks) *32 (ParallelLoadThreads) * 250MB = 64GB (minimum) 64 * 2 = 128 GB (recommended)
When eight threads are used to pull data from the source (MaxFullLoadSubTasks
)
and 32 threads on DMS end (ParallelLoadThreads
) are configured to create one .csv
file per thread for loading into Amazon Redshift, with the size of each csv file set as 250MB
(maxfilesize
), the minimum memory consumed by DMS replication instance is 64GB.
It is recommended to select a DMS instance with 128GB memory in this scenario.
If memory on the replication instance is not sufficient, you can set
enableParallelBatchInMemoryCSVFiles
to false for the files to be written to disk. Note that, if
disk is used, there may be a performance drop of about 20% when compared to using memory.
The other aspect which can improve the performance of data loading into Amazon Redshift is the
resource utilization of the Amazon Redshift cluster. Monitor for memory and CPU utilization metrics on
the Amazon Redshift cluster. It is recommended to identify outlier disk spill queries, which will cause
concurrent ETL queries to slow down. You can use this query
If the Redshift cluster is used by multiple workloads, and is affecting the AWS DMS performance, it’s a good idea to separate out the workloads by using Amazon Redshift data sharing functionality.
Another aspect to consider is when you load data into a table, Amazon Redshift distributes the
rows of the table to each of the compute nodes according to the table's distribution
style. It's important for large tables with many updates/deletes to choose a
distribution key that matches the column of the primary key with the highest cardinality
AWS recommends pre-creating the target tables on Amazon Redshift with the right distribution keys. If you allow AWS DMS to create the target tables, it does not currently choose a distribution key.
-
Ensure the target tables have Primary keys.
-
Ensure you have distribution keys created for the target tables. Refer to Choose the best distribution style for more information.