AWS DataSync
User Guide

How AWS DataSync Works

In this section, you can find information about components, terms, and how DataSync works.

AWS DataSync Architecture

The architectural diagrams show how DataSync transfers data between on-premises storage systems and AWS storage services, and between in-cloud storage systems and AWS storage services.

Source (From) Destination (To)

NFS file system

Amazon EFS file system

NFS file system

Amazon S3

Amazon EFS

NFS file system

NFS file system or Amazon EFS

Amazon S3

Amazon S3

NFS file system or Amazon EFS

Note

When copying between two Amazon EFS file systems, we recommend using the NFS (source) to EFS (destination) use case.

Transfer Data from On-Premises to AWS

In the following diagram, you can see a high-level view of the DataSync architecture for transferring files between on-premises storage and AWS services.

Transfer Data from In-Cloud NFS to In-Cloud NFS or S3

DataSync can transfer data from an in-cloud NFS file system to AWS. To perform this transfer, the DataSync agent must be located in the same AWS Region and same AWS account where the file system is deployed. This type of transfer includes transfers from EFS to EFS, transfers from self-managed NFS to Amazon EFS, and transfers to S3.

In the following diagram, you can see a high-level view of the DataSync architecture for transferring data from in-cloud NFS to in-cloud NFS or S3.

Note

Deploy the agent in the AWS Region and AWS account where the source EFS or self-managed NFS file system resides.

For detailed instructions on how to get started, see Getting Started with AWS DataSync.

Transfer from S3 to In-Cloud NFS

DataSync can transfer data from S3 to an in-cloud NFS file system that is located in the same AWS account and AWS Region where the agent is deployed. This approach includes transfers from S3 to EFS, or from S3 to self-managed NFS. The following diagram illustrates this type of transfer.

In the following diagram, you can see a high-level view of the DataSync architecture for transferring data from S3 to an in-cloud NFS file system.

Components and Terminology

The components of DataSync include the following:

  • Agent – a virtual machine used to read data from or write data to an on-premises location.

  • Location – any source or destination location used in the data transfer (for example, Amazon S3 or Amazon EFS).

  • Task – a task includes two locations (source and destination), and also the configuration of how to transfer the data from one location to the other. Configuration settings can include options such as how to treat metadata, deleted files, and copy permission. A task is the complete definition of a data transfer.

  • Task execution – an individual run of a task, which includes options such as start time, end time, bytes written, and status.

Agent

An agent is a virtual machine (VM) that is owned by the user, and is used to read or write data from on-premises storage system. The agent is currently deployed on a VMware ESXi hypervisor. You use the AWS DataSync Management Console or the API to set up and activate your agent. The activation process associates your agent VM with your AWS account. For information about agents, see Working with Agents.

An agent that is functioning properly has the status ONLINE. If an agent is unable to communicate with AWS, it transitions to OFFLINE status. This transition can result from issues with a network partition, firewall misconfiguration, and other events that make the agent VM unable to connect to AWS. The status of an agent that is powered off also shows as OFFLINE.

Location

A location is an endpoint of a task. Each task has two locations—a source location and destination location. AWS DataSync supports Network File System (NFS), Amazon EFS, and Amazon S3 as location types. For more information, see Working with Locations.

Task

A task includes two locations (source and destination), and the configuration of how to transfer the data from one location to the other. Configuration settings can include options such as how to treat metadata, deleted files, and permission. A task is the complete definition of a data transfer.

Task Execution

A task execution is an individual run of a task, which shows information such as start time, end time, number of transferred files, and status.

A task execution has four transition phases and two terminal statuses, as shown in the following diagram.

If the VerifyMode option is not enabled, a terminal status occurs after the TRANSFERRING phase. Otherwise, it occurs after the VERIFYING phase. The two terminal statuses are these:

  • SUCCESS

  • ERROR

For detailed information about these phases and statuses, see Understanding Task Creation Statuses.

How DataSync Transfers Files

When a task starts, it goes through different statuses: LAUNCHING, PREPARING, TRANSFERRING and VERIFYING. In the LAUNCHING status, DataSync initializes the task execution. In the PREPARING status, DataSync examines the source and destination file systems to determine which files to sync. It does so by recursively scanning the contents of the source and destination file systems for differences. The time that DataSync spends in the PREPARING status depends on the number of files in both the source and destination file systems. It also depends on the performance of these file systems and usually takes between a few minutes to a few hours. For more information, see Starting a Task.

After the scanning is done, and the differences are calculated, DataSync transitions to the TRANSFERRING status. At this point, DataSync starts transferring files from the source file system to the destination. Only files that have been added, modified, or deleted are transferred.

When creating or starting a task, you can configure options that determine which metadata in the source file system that you want to preserve. You can also configure your task's settings to keep or delete files in the destination even if they aren't found in the source file system.

After the TRANSFERRING phase is done, DataSync verifies consistency between the source and destination file systems. This is the VERIFYING phase. By default, DataSync performs a full consistency verification of your source and destination. DataSync rescans the content of the source and destination for any differences. If no differences are found, the task succeeds. Otherwise, the task is marked with a verification failure. For information about DataSync status, see Understanding Task Creation Statuses.

How AWS DataSync Verifies Data Integrity

AWS DataSync locally calculates the checksum of every file in the source file system and the destination and compares them. Additionally, DataSync compares the metadata of every file in the source and destination and compares them. If there are differences in either one, verification fails with an error code that specifies precisely what failed. For examples, you see error codes such as Checksum failure, Metadata failure, Files were added, Files were removed, and so on.

For more information, see Understanding Task Creation Statuses and Enable verification in the Configuring Task Settings section.

How DataSync Handles Open and Locked Files

In general, DataSync can transfer open files without any limitations.

If a file is open and it's being written to during the transfer, DataSync detects data inconsistency (that is, the file on the source is different from the file on the destination) in the VERIFYING phase.

If a file is locked and the NFS server prevents DataSync from opening it, DataSync skips transferring it, logs the error during the TRANSFERRING phase and sends a verification error.

Videos and Blogs

You can use this video and these blogs to learn more about AWS DataSync: