How AWS DataSync works - AWS DataSync

How AWS DataSync works

In this section, you can find information about components, terms, and how DataSync works.

AWS DataSync architecture

The architectural diagrams show how DataSync transfers data between self-managed storage systems and AWS storage services, and between in-cloud storage systems and AWS storage services.

For a list of all DataSync supported source and destination endpoints, see Working with locations.

Data transfer between self-managed storage and AWS

The following diagram shows a high-level view of the DataSync architecture for transferring files between self-managed storage and AWS services.

Data transfer between AWS storage services

The following diagram provides a high-level view of the DataSync architecture for transferring files between AWS services within the same AWS account. This architecture applies to both in-Region and cross-Region transfers.

Important

When you use DataSync to copy files or objects between AWS Regions, you pay for data transfer between Regions. This is billed as data transfer OUT from your source Region to your destination Region. For more information, see Data transfer pricing.

Data transfer using a DataSync EC2 agent deployed in a Region

You can use DataSync to transfer data between AWS services in different AWS accounts, or between self-managed file systems in AWS and Amazon S3, by deploying the DataSync Amazon EC2 agent in an AWS Region. For more information, see Using the DataSync agent deployed in AWS Regions.

Components and terminology

The components of DataSync include the following:

  • Agent – A virtual machine (VM) that's used to read data from or write data to a self-managed location. An agent isn't required when transferring between AWS storage services in the same AWS account.

  • Location – Any source or destination location that's used in the data transfer, such as, Amazon S3, Amazon EFS, Amazon FSx for Windows File Server, Network File System (NFS), Server Message Block (SMB), Hadoop Distributed File System (HDFS), or self-managed object storage.

  • Task – Consists of a source location and a destination location, and a configuration that defines how data is transferred. A task always transfers data from the source to the destination. The configuration can include options such as task schedule, bandwidth limit, and so on. A task is the complete definition of a data transfer.

  • Task execution – An individual run of a task, which includes information such as the start time, end time, bytes written, and status.

Agent

An agent is a VM that you own that's used to read or write data from self-managed storage systems. The agent can be deployed on VMware ESXi, KVM, Microsoft Hyper-V hypervisors, or it can be launched as an Amazon EC2 instance. You use the AWS DataSync console or the API to set up and activate your agent. The activation process associates your agent VM with your AWS account. For information about agents, see Working with agents.

An agent that's functioning properly has the status ONLINE. If an agent is unable to communicate with AWS, it transitions to OFFLINE status. This transition can result from issues with a network partition, firewall misconfiguration, and other events that make the agent VM unable to connect to AWS. The status of an agent that's powered off also shows as OFFLINE.

Location

A location is an endpoint of a task. Each task has two locations—a source location and a destination location. AWS DataSync supports Network File System (NFS), Server Message Block (SMB), Hadoop Distributed File System (HDFS), self-managed object storage, Amazon EFS, Amazon FSx for Windows File Server, and Amazon S3 as location types. For more information, see Working with locations.

Task

A task includes two locations (source and destination), and the configuration of how to transfer the data from one location to the other. The configuration settings can include options such as how to treat metadata, deleted files, and permissions. A task is the complete definition of a data transfer.

Task execution

A task execution is an individual run of a task, which shows information such as the start time, end time, number of transferred files, and status.

A task execution has five transition phases and two terminal statuses, as shown in the following diagram. These phases and statuses are:

  • QUEUEING – This phase consists of queuing the task executions that are running using the same agent.

  • LAUNCHING – During this phase, the task execution is initialized.

  • PREPARING – During this phase, DataSync computes which files need to be transferred.

  • TRANSFERRING – During this phase, DataSync transfers data to AWS.

  • VERIFYING – During this optional phase, DataSync performs a full data and metadata integrity verification. This phase occurs only if the VerifyMode option is enabled during configuration.

  • SUCCESS or ERROR – When the task is finished, DataSync sets the task to one of these terminal statuses, depending on whether it was successful.

If the VerifyMode option isn't enabled in the task configuration, the terminal status is set after the TRANSFERRING phase. Otherwise, it is set after the VERIFYING phase. The two terminal statuses are these:

  • SUCCESS

  • ERROR

For detailed information about these phases and statuses, see Understanding task execution statuses.

How DataSync transfers files

When a task starts, it goes through different phases: LAUNCHING, PREPARING, TRANSFERRING, and VERIFYING. In the LAUNCHING phase, DataSync initializes the task execution. In the PREPARING phase, DataSync examines the source and destination file systems to determine which files to sync. It does so by recursively scanning the contents and metadata of files on the source and destination file systems for differences.

The time that DataSync spends in the PREPARING phase depends on the number of files in both the source and destination file systems. It also depends on the performance of these file systems and usually takes between a few minutes to a few hours. For more information, see Starting a task.

After the scanning is done and the differences are calculated, DataSync transitions to the TRANSFERRING phase. At this point, DataSync starts transferring files and metadata from the source file system to the destination. DataSync copies changes to files with contents or metadata that are different between the source and the destination. You can narrow down the copied files by filtering the data or by configuring DataSync to not overwrite files that are already present in the destination.

Note

By default, any changes to metadata on the source storage result in this metadata being copied to the destination storage.

After the TRANSFERRING phase is done, DataSync verifies consistency between the source and destination file systems. This is the VERIFYING phase.

When DataSync transfers data, it always performs data integrity checks during the transfer. You can enable additional verification to compare the source and destination at the end of a transfer. This additional check can verify the entire dataset or only the files that were transferred as part of the task execution. For most use cases, we recommend verifying only the files transferred.

How AWS DataSync verifies data integrity

AWS DataSync locally calculates the checksum of every file in the source file system and the destination and compares them. Additionally, DataSync compares the metadata of every file in the source and destination and compares them. If there are differences in either one, verification fails with an error code that specifies precisely what failed. For examples, you see error codes such as Checksum failure, Metadata failure, Files were added, Files were removed, and so on.

For more information, see Understanding task creation statuses and Enable verification in the Configuring task settings section.

How DataSync handles open and locked files

In general, DataSync can transfer open files without any limitations.

If a file is open and it's being written to during the transfer, DataSync detects data inconsistency during the VERIFYING phase. This phase is when DataSync detects whether the file on the source is different from the file on the destination.

If a file is locked and the server prevents DataSync from opening it, DataSync skips transferring it. DataSync logs an error during the TRANSFERRING phase and sends a verification error.