How AWS DataSync Works - AWS DataSync

How AWS DataSync Works

In this section, you can find information about components, terms, and how DataSync works.

AWS DataSync Architecture

The architectural diagrams show how DataSync transfers data between self-managed storage systems and AWS storage services, and between in-cloud storage systems and AWS storage services.

For a list of all DataSync supported source and destination endpoints, see Working with Locations.

Data Transfer Between Self-Managed Storage and AWS

The following diagram shows a high-level view of the DataSync architecture for transferring files between self-managed storage and AWS services.

Data Transfer Between AWS Storage Services

The following diagram provides a high-level view of the DataSync architecture for transferring files between AWS services within the same account. This architecture applies to both in-Region and cross-Region transfers.

Important

When you use DataSync to copy files or objects between AWS Regions, you pay for data transfer between Regions. This is billed as data transfer OUT from your source Region to your destination Region. For more information, see Data Transfer pricing.

Data Transfer Using a DataSync EC2 Agent Deployed in a Region

You can use DataSync to transfer data between AWS services in different accounts, or between self-managed file systems in AWS and Amazon S3, by deploying the DataSync EC2 agent in an AWS Region. For more information, see Using the DataSync EC2 Agent Deployed in AWS Regions.

Components and Terminology

The components of DataSync include the following:

  • Agent – A virtual machine (VM) that is used to read data from or write data to a self-managed location. An agent is not required when transferring between AWS storage services in the same account.

  • Location – Any source or destination location that is used in the data transfer (for example, Amazon S3, Amazon EFS, Amazon FSx for Windows File Server, NFS, SMB, or self-managed object storage).

  • Task – Consists of a source location and a destination location, and configuration that defines how data is transferred. A task always transfers data from the source to the destination. Configuration can include options such as task schedule, bandwidth limit, etc. A task is the complete definition of a data transfer.

  • Task execution – An individual run of a task, which includes information such as start time, end time, bytes written, and status.

Agent

An agent is a VM that you own that is used to read or write data from self-managed storage systems. The agent can be deployed on VMware ESXi, KVM, Microsoft Hyper-V hypervisors, or it can be launched as an Amazon EC2 instance. You use the AWS DataSync Management Console or the API to set up and activate your agent. The activation process associates your agent VM with your AWS account. For information about agents, see Working with Agents.

An agent that is functioning properly has the status ONLINE. If an agent is unable to communicate with AWS, it transitions to OFFLINE status. This transition can result from issues with a network partition, firewall misconfiguration, and other events that make the agent VM unable to connect to AWS. The status of an agent that is powered off also shows as OFFLINE.

Location

A location is an endpoint of a task. Each task has two locations—a source location and a destination location. AWS DataSync supports Network File System (NFS), Server Message Block (SMB), self-managed object storage, Amazon EFS, Amazon FSx for Windows File Server, and Amazon S3 as location types. For more information, see Working with Locations.

Task

A task includes two locations (source and destination), and the configuration of how to transfer the data from one location to the other. Configuration settings can include options such as how to treat metadata, deleted files, and permission. A task is the complete definition of a data transfer.

Task Execution

A task execution is an individual run of a task, which shows information such as start time, end time, number of transferred files, and status.

A task execution has five transition phases and two terminal statuses, as shown in the following diagram.

If the VerifyMode option is not enabled, a terminal status occurs after the TRANSFERRING phase. Otherwise, it occurs after the VERIFYING phase. The two terminal statuses are these:

  • SUCCESS

  • ERROR

For detailed information about these phases and statuses, see Understanding Task Execution Statuses.

How DataSync Transfers Files

When a task starts, it goes through different statuses: LAUNCHING, PREPARING, TRANSFERRING, and VERIFYING. In the LAUNCHING status, DataSync initializes the task execution. In the PREPARING status, DataSync examines the source and destination file systems to determine which files to sync. It does so by recursively scanning the contents and metadata of files on the source and destination file systems for differences.

The time that DataSync spends in the PREPARING status depends on the number of files in both the source and destination file systems. It also depends on the performance of these file systems and usually takes between a few minutes to a few hours. For more information, see Starting a Task.

After the scanning is done and the differences are calculated, DataSync transitions to the TRANSFERRING status. At this point, DataSync starts transferring files and metadata from the source file system to the destination. DataSync copies changes to files with contents or metadata that are different between the source and the destination. You can narrow down the copied files by filtering the data or by configuring DataSync to not overwrite files that are already present on the destination.

Note

By default, any changes to metadata on the source storage result in this metadata being copied to the destination storage.

After the TRANSFERRING phase is done, DataSync verifies consistency between the source and destination file systems. This is the VERIFYING phase.

When DataSync transfers data, it always performs data integrity checks during the transfer. You can enable additional verification to compare source and destination at the end of a transfer. This additional check can verify the entire dataset or only the files that were transferred as part of the task execution. For most use cases, we recommend verifying only the files transferred.

How AWS DataSync Verifies Data Integrity

AWS DataSync locally calculates the checksum of every file in the source file system and the destination and compares them. Additionally, DataSync compares the metadata of every file in the source and destination and compares them. If there are differences in either one, verification fails with an error code that specifies precisely what failed. For examples, you see error codes such as Checksum failure, Metadata failure, Files were added, Files were removed, and so on.

For more information, see Understanding Task Creation Statuses and Enable verification in the Configuring Task Settings section.

How DataSync Handles Open and Locked Files

In general, DataSync can transfer open files without any limitations.

If a file is open and it's being written to during the transfer, DataSync detects data inconsistency in the VERIFYING phase. That is, this is when DataSync detects if the file on the source is different from the file on the destination.

If a file is locked and the server prevents DataSync from opening it, DataSync skips transferring it. DataSync logs an error during the TRANSFERRING phase and sends a verification error.