How AWS DataSync Works - AWS DataSync

How AWS DataSync Works

In this section, you can find information about components, terms, and how DataSync works.

AWS DataSync Architecture

The architectural diagrams show how DataSync transfers data between self-managed storage systems and AWS storage services, and between in-cloud storage systems and AWS storage services.

For a list of all DataSync supported source and destination endpoints, see Working with Locations.

Transfer Data from Self-Managed Storage to AWS

In the following diagram, you can see a high-level view of the DataSync architecture for transferring files between self-managed storage and AWS services.

Transfer Data from In-Cloud NFS to In-Cloud NFS or S3

DataSync can transfer data from an in-cloud file system to AWS. To perform this transfer, the DataSync agent must be located in the same AWS Region and same AWS account where the source file system is deployed. This type of transfer includes the following:

  • Transfers between Amazon EFS or Amazon FSx for Windows File Server file systems to Amazon EFS

  • Transfers from self-managed file systems to managed file systems

  • Transfers between in-cloud file systems and Amazon S3

Important

Deploy your agent such that it does not require network traffic between Availability Zones (to avoid charges for such traffic).

  • To access your Amazon EFS or Amazon FSx for Windows File Server file system, deploy the agent in an Availability Zone that has a mount target to your file system.

  • For self-managed file systems, deploy the agent in the Availability Zone where your file system resides.

To learn more about data transfer prices for all AWS Regions, see Amazon EC2 On-Demand Pricing.

For example, the following diagram shows a high-level view of the DataSync architecture for transferring data from in-cloud NFS to in-cloud NFS or Amazon S3.

Note

Deploy the agent in the AWS Region and AWS account where the source EFS or self-managed NFS file system resides.

For detailed instructions on how to get started, see Getting Started with AWS DataSync.

Transfer from S3 to In-Cloud NFS

DataSync can transfer data from S3 to an in-cloud file system that is located in the same AWS account and AWS Region where the agent is deployed.

In the following diagram, you can see a high-level view of the DataSync architecture for transferring data from S3 to an in-cloud NFS file system.

Components and Terminology

The components of DataSync include the following:

  • Agent – A virtual machine used to read data from or write data to a self-managed location.

  • Location – Any source or destination location used in the data transfer (for example, Amazon S3, Amazon EFS, Amazon FSx for Windows File Server, NFS, SMB, or self-managed object storage).

  • Task – A task includes two locations (source and destination), and also the configuration of how to transfer the data from one location to the other. Configuration settings can include options such as how to treat metadata, deleted files, and copy permission. A task is the complete definition of a data transfer.

  • Task execution – An individual run of a task, which includes options such as start time, end time, bytes written, and status.

Agent

An agent is a virtual machine (VM) that is owned by the user and is used to read or write data from self-managed storage systems. The agent can be deployed on VMware ESXi, KVM, and Microsoft Hyper-V hypervisors. You use the AWS DataSync Management Console or the API to set up and activate your agent. The activation process associates your agent VM with your AWS account. For information about agents, see Working with Agents.

An agent that is functioning properly has the status ONLINE. If an agent is unable to communicate with AWS, it transitions to OFFLINE status. This transition can result from issues with a network partition, firewall misconfiguration, and other events that make the agent VM unable to connect to AWS. The status of an agent that is powered off also shows as OFFLINE.

Location

A location is an endpoint of a task. Each task has two locations—a source location and destination location. AWS DataSync supports Network File System (NFS), Server Message Block (SMB), self-managed object storage, Amazon EFS, Amazon FSx for Windows File Server, and Amazon S3 as location types. For more information, see Working with Locations.

Task

A task includes two locations (source and destination), and the configuration of how to transfer the data from one location to the other. Configuration settings can include options such as how to treat metadata, deleted files, and permission. A task is the complete definition of a data transfer.

Task Execution

A task execution is an individual run of a task, which shows information such as start time, end time, number of transferred files, and status.

A task execution has five transition phases and two terminal statuses, as shown in the following diagram.

If the VerifyMode option is not enabled, a terminal status occurs after the TRANSFERRING phase. Otherwise, it occurs after the VERIFYING phase. The two terminal statuses are these:

  • SUCCESS

  • ERROR

For detailed information about these phases and statuses, see Understanding Task Execution Statuses.

How DataSync Transfers Files

When a task starts, it goes through different statuses: LAUNCHING, PREPARING, TRANSFERRING and VERIFYING. In the LAUNCHING status, DataSync initializes the task execution. In the PREPARING status, DataSync examines the source and destination file systems to determine which files to sync. It does so by recursively scanning the contents of the source and destination file systems for differences. The time that DataSync spends in the PREPARING status depends on the number of files in both the source and destination file systems. It also depends on the performance of these file systems and usually takes between a few minutes to a few hours. For more information, see Starting a Task.

After the scanning is done, and the differences are calculated, DataSync transitions to the TRANSFERRING status. At this point, DataSync starts transferring files from the source file system to the destination. Only files that have been added, modified, or deleted are transferred.

When creating or starting a task, you can configure options that determine which metadata in the source file system that you want to preserve. You can also configure your task's settings to keep or delete files in the destination even if they aren't found in the source file system.

After the TRANSFERRING phase is done, DataSync verifies consistency between the source and destination file systems. This is the VERIFYING phase. By default, DataSync performs a full consistency verification of your source and destination. DataSync rescans the content of the source and destination for any differences. If no differences are found, the task succeeds. Otherwise, the task is marked with a verification failure. For information about DataSync status, see Understanding Task Creation Statuses.

How AWS DataSync Verifies Data Integrity

AWS DataSync locally calculates the checksum of every file in the source file system and the destination and compares them. Additionally, DataSync compares the metadata of every file in the source and destination and compares them. If there are differences in either one, verification fails with an error code that specifies precisely what failed. For examples, you see error codes such as Checksum failure, Metadata failure, Files were added, Files were removed, and so on.

For more information, see Understanding Task Creation Statuses and Enable verification in the Configuring Task Settings section.

How DataSync Handles Open and Locked Files

In general, DataSync can transfer open files without any limitations.

If a file is open and it's being written to during the transfer, DataSync detects data inconsistency in the VERIFYING phase. That is, this is when DataSync detects if the file on the source is different from the file on the destination.

If a file is locked and the server prevents DataSync from opening it, DataSync skips transferring it. DataSync logs an error during the TRANSFERRING phase and sends a verification error.