Database migration considerations - Optimizing AWS Database Migration Service Performance with Amazon Redshift as Target

Database migration considerations

A carefully planned data migration strategy is important to prevent a sub-optimal experience that ends up creating more problems than it solves. It is important to be aware of different phases in the migration process to plan for and to optimize the performance of each of those individual stages, to prevent any delays in the data migration. Accurately estimating the storage needs and choosing the right networking connectivity between components helps prevent migration issues.

Migration phases

Following are the various phases typically seen during the data migration process:

Assess source database

As a first step in the data migration process, gather information about the tables in the source database that need to be migrated to the target database. Understand the requirement of the organization, whether you need to perform a one-time migration (via full load) or whether you need to migrate the data on an ongoing basis (continuous data capture). It is important to be aware of the table sizes as well as database sizes, and verify whether any large objects (LOBs) are present in the migration scope. LOBs are generally migrated one at a time, piece by piece. They can significantly slow down the migration and require more processing resources than standard objects. For more information on how LOBs are handled by AWS DMS, refer to Setting LOB support for source databases in an AWS DMS task.

Also understand the nature of transactions (Data Manipulation Language (DML), Data Query Language (DQL), Data Definition Language (DDL) and so on.) happening on your database and calculate Transactions per Second (TPS). Based on this information, you can isolate the busy tables which have a higher number of transactions to prevent them from impacting the migration of other tables, and you can allocate additional resources for migrating these tables. Be aware of indexes, triggers, and referential integrity database constraints, which can affect the migration performance. It's a best practice to use native monitoring for your source database (DB) engine to be sure that your DB isn't experiencing a performance bottleneck; for example, CPU or memory contention, or input/output (IO) saturation.

If the source database is already constrained on resources, the data migration will be further slowed down. It is important to understand the behavior of your source database engine to come up with an optimal migration plan. For a comprehensive list of valid sources supported by AWS DMS, refer to Sources for AWS DMS.

Since the target engine (Amazon Redshift) is different from the source database engine, you might need to convert the source database schema and a majority of database code objects, including views, stored procedures, and functions, to a format compatible with the target. AWS Schema Conversion Tool (AWS SCT) can help out with this requirement. When using SCT, you can create mapping rules and any objects that cannot be automatically converted are clearly marked so that they can be manually converted to complete the migration. You can review the database migration assessment report which summarizes all of the action items for schemas that can't be converted automatically to the target engine. Once schema conversion is complete, SCT can help migrate data from a range of data warehouses to Amazon Redshift using built-in data migration agents.

The following screenshot shows the list of action items shown by the database migration assessment report in SCT.

A screenshot showing the list of action items shown by the database migration assessment report in SCT .

The list of action items shown by the database migration assessment report in SCT

Determine the target

AWS DMS can use many of the most popular databases as a target for data replication. For a comprehensive list of valid targets, refer to Targets for AWS DMS. It’s important to be aware of the latency requirements of your organization and the business use-cases while choosing a target for migrating your data. For example, Amazon Redshift is optimized for online analytical processing (OLAP), meaning it is used to aggregate large quantities of data across long time periods and may not require real-time updates. To ensure minimal disruption to those processes, consider loading in batches with larger latencies.

For real-time updates, consider using online transaction processing (OLTP) systems such as Amazon Relational Database Service (Amazon RDS) as a target.

Connectivity options

It is important to validate and monitor connectivity between the source database and AWS DMS, and between AWS DMS to the target, to avoid any performance issues. Connectivity options vary depending on where the different components are deployed.

The location of the AWS DMS replication instances also varies based on the target chosen. For example, if an Amazon Redshift cluster is used as a target, it must be in the same AWS account and same AWS Region as the replication instance. If the source endpoint, the replication instance, and the target endpoint are in the same VPC, traffic flows within the VPC. If different VPCs are used for deployment, you can link these VPCs together by using VPC peering or AWS Transit Gateway. If the source endpoint is an on-premises database in your corporate data center, it can be connected to AWS resources using AWS Direct Connect, a VPN connection, or even via the public internet.

However, it is highly recommended to choose the right options to connect resources, to avoid any latency issues. AWS Direct Connect provides the shortest path to AWS resources, because it is a dedicated connection between your corporate data centers and AWS. While in transit, your network traffic remains on the AWS global network and never goes over the internet. This reduces the chance of hitting bottlenecks or unexpected increases in latency when compared to using VPN or the public internet.

A diagram depicting a VPN connection going over the public internet .

VPN connection going over the public internet

A diagram that depicts using AWS Direct Connect as a dedicated connection between an on-premises database and AWS .

Using AWS Direct Connect as a dedicated connection between an on-premises database and AWS

Full load (initial load) phase

Full load is used for loading all the data that is available in the tables in the source database to the target, and is typically used for one-time data migrations. During a full load task, AWS DMS performs a full table scan for each of the source tables in parallel. Because the entire data is being loaded from the tables, it is important to optimize the full load settings to accelerate the migration of large or partitioned tables. These configurations may vary depending on the type of target chosen. AWS SCT has agents which can help with the full load to Amazon Redshift. During a full load migration task, you can accelerate the migration of large or partitioned tables by splitting the table into segments and loading the segments in-parallel in the same migration task. We will discuss more about table segmentation later in the whitepaper.

CDC phase (insert, update, delete)

If an ongoing data migration is required from the source database to the target, you can leverage CDC functionality provided by AWS DMS to capture incremental changes. This process works by collecting changes to the database logs using the database engine's native API. If the CDC process is not tuned accurately, your workloads may face high source or target latencies.

Storage

It is a good idea to be aware of amount of data that needs to be migrated from the source database to plan accordingly for the storage required by AWS DMS replication instances as well as by Amazon Redshift (target).

For example, if you are using Oracle database as a source, be aware of the log switches and the amount of redo logs generated per day by the database to evaluate the storage needs.

Example showing the number of log switches and redo generation per day.

Example showing the number of log switches and redo generation per day

In AWS DMS, most of the processing happens in memory. However, large transactions might require some buffering on disk. Cached transactions and log files are also written to disk. If your source system is busy or takes large transactions, you might need to increase your storage on DMS as well as target systems. If you're running multiple tasks on the replication server, you might also need a storage increase.

If you have detailed debug logging turned on for DMS, ensure that you turn it off when you don’t need it, because detailed debug logging requires a large amount of storage space and can potentially fill up the disk, leading to DMS task failures. For more information about storage issues, refer to Why is my AWS DMS replication DB instance in the storage-full status?.

Based on the amount of data that needs to be migrated into your Amazon Redshift cluster, you can choose the right instance types. You can use the Amazon Redshift sizing calculator (sign-in required) available in the Amazon Redshift console, which estimates your cluster configuration based on data size and data access frequency.

Networking bandwidth

If you are using VPN tunnel between your on-premises source database and AWS DMS, ensure that the bandwidth is sufficient for your workload. Each AWS Site-to-Site VPN connection has two tunnels and each tunnel supports a maximum throughput of up to 1.25 Gbps. To be highly available and to have no disruptions, ensure that you utilize both the VPN tunnels. If you are using AWS Direct Connect, you can either have a dedicated connection supporting one Gbps, 10 Gbps, 100 Gbps, or a hosted connection which is sourced from AWS Direct Connect Partners that have network links between themselves and AWS. 

To optimize performance, ensure that your network throughput is fast. When you address network performance, get a network sniffer trace to analyze network performance. In addition, use network monitoring tools to identify issues with bandwidth performance limits being exceeded on the network, the client, or the server. You can determine if issues are related to throttling, bandwidth, utilization, or any other configuration-related issues. There are many utilities such as netstat, traceroute, and ipconfig, and port scanners available for troubleshooting or checking on the connection to look up anomalies. If possible, engage network or systems engineers early in the planning process.

Another simple test for finding out the networking speed is done by transferring a file from your source database to an Amazon Elastic Compute Cloud (Amazon EC2) instance in AWS within the same virtual private cloud (VPC) as the AWS DMS service, and noting the amount of time the process takes. The transfer speed can be calculated by dividing the amount of data transferred with the time taken to transfer:

Source database statistics .

Source database statistics

Verify the NetworkReceiveThroughput metric on DMS end indicating the incoming traffic on replication instance.

A graph showing an example of average network traffic of 11 MB/s on the DMS instance .

An example of average network traffic of 11 MB/s on the DMS instance

Pricing for migration

AWS DMS pricing varies based on the type of replication instances chosen and any additional log storage. Each database migration instance includes storage sufficient for swap space, replication logs, and data cache for most replications and inbound data transfer is free. All data transfer into AWS DMS is free, and data transferred between AWS DMS and Amazon RDS, Amazon Redshift and Amazon EC2 instances in the same Availability Zone also is free. Standard AWS data transfer rates apply when you migrate your source database to a target database in a different Availability Zone, Region, or outside of AWS. Standard AWS data transfer rates are listed on the EC2 instance pricing page.