Configuring AWS DataSync transfers with Amazon S3 - AWS DataSync

Configuring AWS DataSync transfers with Amazon S3

To transfer data to or from your S3 bucket, you must create an AWS DataSync transfer location. DataSync can use this location as a source or destination for transferring data.

Important

Before you create your location, make sure that you read the following sections:

Accessing S3 buckets

DataSync requires access to your Amazon S3 bucket. To do this, DataSync assumes an AWS Identity and Access Management (IAM) role with an IAM policy and AWS Security Token Service (AWS STS) trust relationship. The policy determines which actions that the role can perform.

DataSync can create this role for you, but there are situations where you may need to create a role manually. For more information, see Using IAM policies to access your S3 bucket.

Storage class considerations with Amazon S3 locations

DataSync can transfer objects directly into the Amazon S3 storage class that you specify when creating your Amazon S3 location. Some storage classes have behaviors that can affect your Amazon S3 storage costs. For more information, see Amazon S3 pricing.

Important

New objects copied to an S3 bucket are stored using the storage class that you specify when creating your Amazon S3 location. DataSync won't change the storage class of existing objects in the bucket (even if that object was modified in the source location).

Amazon S3 storage class Considerations
S3 Standard Choose S3 Standard to store your frequently accessed files redundantly in multiple Availability Zones that are geographically separated. This is the default if you don't specify a storage class.
S3 Intelligent-Tiering

Choose S3 Intelligent-Tiering to optimize storage costs by automatically moving data to the most cost-effective storage access tier.

You pay a monthly charge per object stored in the S3 Intelligent-Tiering storage class. This Amazon S3 charge includes monitoring data access patterns and moving objects between tiers.

S3 Standard-IA

Choose S3 Standard-IA to store your infrequently accessed objects redundantly in multiple Availability Zones that are geographically separated.

Objects stored in the S3 Standard-IA storage class can incur additional charges for overwriting, deleting, or retrieving. Consider how often these objects change, how long you plan to keep these objects, and how often you need to access them. Changes to object data or metadata are equivalent to deleting an object and creating a new one to replace it. This results in additional charges for objects stored in the S3 Standard-IA storage class.

Objects less than 128 KB are smaller than the minimum capacity charge per object in the S3 Standard-IA storage class. These objects are stored in the S3 Standard storage class.

S3 One Zone-IA

Choose S3 One Zone-IA to store your infrequently accessed objects in a single Availability Zone.

Objects stored in the S3 One Zone-IA storage class can incur additional charges for overwriting, deleting, or retrieving. Consider how often these objects change, how long you plan to keep these objects, and how often you need to access them. Changes to object data or metadata are equivalent to deleting an object and creating a new one to replace it. This results in additional charges for objects stored in the S3 One Zone-IA storage class.

Objects less than 128 KB are smaller than the minimum capacity charge per object in the S3 One Zone-IA storage class. These objects are stored in the S3 Standard storage class.

S3 Glacier Instant Retrieval

Choose S3 Glacier Instant Retrieval to archive objects that are rarely accessed but require retrieval in milliseconds.

Data stored in the S3 Glacier Instant Retrieval storage class offers cost savings compared to the S3 Standard-IA storage class with the same latency and throughput performance. S3 Glacier Instant Retrieval has higher data access costs than S3 Standard-IA, though.

Objects stored in S3 Glacier Instant Retrieval can incur additional charges for overwriting, deleting, or retrieving. Consider how often these objects change, how long you plan to keep these objects, and how often you need to access them. Changes to object data or metadata are equivalent to deleting an object and creating a new one to replace it. This results in additional charges for objects stored in the S3 Glacier Instant Retrieval storage class.

Objects less than 128 KB are smaller than the minimum capacity charge per object in the S3 Glacier Instant Retrieval storage class. These objects are stored in the S3 Standard storage class.

S3 Glacier Flexible Retrieval

Choose S3 Glacier Flexible Retrieval for more active archives.

Objects stored in S3 Glacier Flexible Retrieval can incur additional charges for overwriting, deleting, or retrieving. Consider how often these objects change, how long you plan to keep these objects, and how often you need to access them. Changes to object data or metadata are equivalent to deleting an object and creating a new one to replace it. This results in additional charges for objects stored in the S3 Glacier Flexible Retrieval storage class.

Objects less than 40 KB are smaller than the minimum capacity charge per object in the S3 Glacier Flexible Retrieval storage class. These objects are stored in the S3 Standard storage class.

You must restore objects archived in this storage class before DataSync can read them. For information, see Working with archived objects in the Amazon S3 User Guide.

When using S3 Glacier Flexible Retrieval, choose the Verify only the data transferred task option to compare data and metadata checksums at the end of the transfer. You can't use the Verify all data in the destination option for this storage class because it requires retrieving all existing objects from the destination.

S3 Glacier Deep Archive

Choose S3 Glacier Deep Archive to archive your objects for long-term data retention and digital preservation where data is accessed once or twice a year.

Objects stored in S3 Glacier Deep Archive can incur additional charges for overwriting, deleting, or retrieving. Consider how often these objects change, how long you plan to keep these objects, and how often you need to access them. Changes to object data or metadata are equivalent to deleting an object and creating a new one to replace it. This results in additional charges for objects stored in the S3 Glacier Deep Archive storage class.

Objects less than 40 KB are smaller than the minimum capacity charge per object in the S3 Glacier Deep Archive storage class. These objects are stored in the S3 Standard storage class.

You must restore objects archived in this storage class before DataSync can read them. For information, see Working with archived objects in the Amazon S3 User Guide.

When using S3 Glacier Deep Archive, choose the Verify only the data transferred task option to compare data and metadata checksums at the end of the transfer. You can't use the Verify all data in the destination option for this storage class because it requires retrieving all existing objects from the destination.

S3 Outposts

The storage class for Amazon S3 on Outposts.

Evaluating S3 request costs when using DataSync

With Amazon S3 locations, you incur costs related to S3 API requests made by DataSync. This section can help you understand how DataSync uses these requests and how they might affect your Amazon S3 costs.

S3 requests made by DataSync

The following table describes the S3 requests that DataSync can make when you’re copying data to or from an Amazon S3 location.

S3 request How DataSync uses it

ListObjectV2

DataSync makes at least one LIST request for every object ending in a forward slash (/) to list the objects that start with that prefix. This request is called during a task’s preparing phase.

HeadObject

DataSync makes HEAD requests to retrieve object metadata during a task’s preparing and verifying phases. There can be multiple HEAD requests per object depending on how you want DataSync to verify the integrity of the data it transfers.

GetObject

DataSync makes GET requests to read data from an object during a task’s transferring phase. There can be multiple GET requests for large objects.

PutObject

DataSync makes PUT requests to create objects in a destination S3 bucket during a task’s transferring phase. Since DataSync uses the Amazon S3 multipart upload feature, there can be multiple PUT requests for large objects.

CopyObject

DataSync makes a COPY request to create a copy of an object only if that object’s metadata changes. This can happen if you originally copied data to the S3 bucket using another service or tool that didn’t carry over its metadata.

Cost considerations

DataSync makes S3 requests on S3 buckets every time you run your task. This can lead to charges adding up in certain situations. For example:

  • You’re frequently transferring objects to or from an S3 bucket.

  • You may not be transferring much data, but your S3 bucket has lots of objects in it. You can still see high charges in this scenario because DataSync makes S3 requests on each of the bucket's objects.

  • You're transferring between S3 buckets, so DataSync is making S3 requests on the source and destination.

To help minimize S3 request costs related to DataSync, consider the following:

What S3 storage classes am I using?

S3 request charges can vary based on the Amazon S3 storage class your objects are using, particularly for classes that archive objects (such as S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, and S3 Glacier Deep Archive).

Here are some scenarios in which storage classes can affect your S3 request charges when using DataSync:

  • Each time you run a task, DataSync makes HEAD requests to retrieve object metadata. These requests result in charges even if you aren’t moving any objects. How much these requests affect your bill depends on the storage class your objects are using along with the number of objects that DataSync scans.

  • If you moved objects into the S3 Glacier Instant Retrieval storage class (either directly or through a bucket lifecycle configuration), requests on objects in this class are more expensive than objects in other storage classes.

  • If you configure your DataSync task to verify that your source and destination locations are fully synchronized, there will be GET requests for each object in all storage classes (except S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive).

  • In addition to GET requests, you incur data retrieval costs for objects in the S3 Standard-IA, S3 One Zone-IA, or S3 Glacier Instant Retrieval storage class.

For more information, see Amazon S3 pricing.

How often do I need to transfer my data?

If you need to move data on a recurring basis, think about a schedule that doesn't run more tasks than you need.

You may also consider limiting the scope of your transfers. For example, you can configure DataSync to focus on objects in certain prefixes or filter what data gets transferred. These options can help reduce the number of S3 requests made each time you run your DataSync task.

Other considerations with Amazon S3 locations

When using Amazon S3 with DataSync, remember the following:

  • Changes to object data or metadata are equivalent to deleting and replacing an object. These changes result in additional charges in the following scenarios:

    • When using object versioning – Changes to object data or metadata create a new version of the object.

    • When using storage classes that can incur additional charges for overwriting, deleting, or retrieving objects – Changes to object data or metadata result in such charges. For more information, see Storage class considerations with Amazon S3 locations.

  • When using object versioning in Amazon S3, running a DataSync task once might create more than one version of an Amazon S3 object.

  • DataSync might not transfer an object if it has nonstandard characters in its name. For more information, see the object key naming guidelines in the Amazon S3 User Guide.

  • To help minimize your Amazon S3 storage costs, we recommend using a lifecycle configuration to stop incomplete multipart uploads. For more information, see the Amazon S3 User Guide.

  • After initially transferring data from an S3 bucket to a file system (for example, NFS or Amazon FSx), subsequent runs of the same DataSync task won't include objects that have been modified but are the same size they were during the first transfer.

Creating your Amazon S3 transfer location

To create the location, you need an existing S3 bucket. If you don't have one, see Getting started with Amazon S3 in the Amazon S3 User Guide.

Tip

If your S3 bucket has objects with different storage classes, learn how DataSync works with these storage classes and how it can affect your AWS bill.

To create an Amazon S3 location
  1. Open the AWS DataSync console at https://console.aws.amazon.com/datasync/.

  2. In the left navigation pane, expand Data transfer, then choose Locations and Create location.

  3. For Location type, choose Amazon S3.

  4. For S3 bucket, choose the bucket that you want to use as a location. (When creating your DataSync task later, you specify whether this location is a source or destination location.)

    If your S3 bucket is located on an AWS Outposts resource, you must specify an Amazon S3 access point. For more information, see Managing data access with Amazon S3 access points in the Amazon S3 User Guide.

  5. For S3 storage class, choose a storage class that you want your objects to use.

    For more information, see Storage class considerations with Amazon S3 locations. DataSync by default uses the S3 Outposts storage class for Amazon S3 on Outposts.

  6. (Amazon S3 on Outposts only) For Agents, specify the Amazon Resource Name (ARN) of the DataSync agent on your Outpost.

    For more information, see Deploy your agent on AWS Outposts.

  7. For Folder, enter a prefix in the S3 bucket that DataSync reads from or writes to (depending on whether the bucket is a source or destination location).

    Note

    The prefix can't begin with a slash (for example, /photos) or include consecutive slashes, such as photos//2006/January.

  8. For IAM role, do one of the following:

    • Choose Autogenerate for DataSync to automatically create an IAM role with the permissions required to access the S3 bucket.

      If DataSync previously created an IAM role for this S3 bucket, that role is chosen by default.

    • Choose a custom IAM role that you created. For more information, see Manually creating an IAM role to access your Amazon S3 bucket.

  9. (Optional) Choose Add tag to tag your Amazon S3 location.

    A tag is a key-value pair that helps you manage, filter, and search for your locations.

  10. Choose Create location.

Using IAM policies to access your S3 bucket

Depending on your S3 bucket's security settings, you may need to create a custom IAM policy that allows DataSync to access the bucket.

Manually creating an IAM role to access your Amazon S3 bucket

While DataSync can create an IAM role for you with the required S3 bucket permissions, you also can configure a role yourself.

To manually create an IAM role to access your Amazon S3 bucket
  1. Open the IAM console at https://console.aws.amazon.com/iam/.

  2. In the left navigation pane, under Access management, choose Roles, and then choose Create role.

  3. On the Select trusted entity page, for Trusted entity type, choose AWS service.

  4. For Use case, choose DataSync in the dropdown list and select DataSync - S3 Location. Choose Next.

  5. On the Add permissions page, choose AmazonS3FullAccess for S3 buckets in AWS Regions. Choose Next.

    You can manually create a more restrictive policy than AmazonS3FullAccess. Here's an example:

    { "Version": "2012-10-17", "Statement": [ { "Action": [ "s3:GetBucketLocation", "s3:ListBucket", "s3:ListBucketMultipartUploads" ], "Effect": "Allow", "Resource": "YourS3BucketArn" }, { "Action": [ "s3:AbortMultipartUpload", "s3:DeleteObject", "s3:GetObject", "s3:ListMultipartUploadParts", "s3:GetObjectTagging", "s3:PutObjectTagging", "s3:PutObject" ], "Effect": "Allow", "Resource": "YourS3BucketArn/*" } ] }

    For Amazon S3 on Outposts, use the following policy:

    { "Version": "2012-10-17", "Statement": [ { "Action": [ "s3-outposts:ListBucket", "s3-outposts:ListBucketMultipartUploads" ], "Effect": "Allow", "Resource": [ "s3OutpostsBucketArn", "s3OutpostsAccessPointArn" ], "Condition": { "StringLike": { "s3-outposts:DataAccessPointArn": "s3OutpostsAccessPointArn" } } }, { "Action": [ "s3-outposts:AbortMultipartUpload", "s3-outposts:DeleteObject", "s3-outposts:GetObject", "s3-outposts:ListMultipartUploadParts", "s3-outposts:GetObjectTagging", "s3-outposts:PutObjectTagging" ], "Effect": "Allow", "Resource": [ "s3OutpostsBucketArn/*", "s3OutpostsAccessPointArn" ], "Condition": { "StringLike": { "s3-outposts:DataAccessPointArn": "s3OutpostsAccessPointArn" } } }, { "Effect": "Allow", "Action": [ "s3-outposts:GetAccessPoint" ], "Resource": "s3OutpostsAccessPointArn" } ] }
  6. Give your role a name and choose Create role.

  7. Open the AWS DataSync console at https://console.aws.amazon.com/datasync/.

  8. Select the refresh button next to the IAM role setting and then choose the role that you just created.

Preventing the cross-service confused deputy problem

To prevent the cross-service confused deputy problem, we recommend using the aws:SourceArn and aws:SourceAccount global condition context keys in your IAM role's trust policy.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "datasync.amazonaws.com" }, "Action": "sts:AssumeRole", "Condition": { "StringEquals": { "aws:SourceAccount": "123456789012" }, "StringLike": { "aws:SourceArn": "arn:aws:datasync:us-east-2:123456789012:*" } } } ] }

Accessing S3 buckets using server-side encryption

DataSync can copy data to or from S3 buckets that use server-side encryption. The type of encryption key a bucket uses can determine if you need a custom policy allowing DataSync to access the bucket.

When using DataSync with S3 buckets that use server-side encryption, remember the following:

  • If your S3 bucket is encrypted with an AWS managed key – DataSync can access the bucket's objects by default if all your resources are in the same AWS account.

  • If your S3 bucket is encrypted with a customer-managed AWS Key Management Service (AWS KMS) key (SSE-KMS) – The key's policy must include the IAM role that DataSync uses to access the bucket.

  • If your S3 bucket is encrypted with a customer-managed SSE-KMS key and in a different AWS account – DataSync needs permission to access the bucket in the other AWS account. You can set up this up by doing the following:

  • If your S3 bucket is encrypted with a customer-provided encryption key (SSE-C) – DataSync can't access this bucket.

The following example is a key policy for a customer-managed SSE-KMS key. The policy is associated with an S3 bucket that uses server-side encryption. The following values are specific to your setup:

  • your-account – Your AWS account.

  • your-admin-role – The IAM role that can administer the key.

  • your-datasync-role – The IAM role that allows DataSync to use the key when accessing the bucket.

{ "Id": "key-consolepolicy-3", "Version": "2012-10-17", "Statement": [ { "Sid": "Enable IAM Permissions", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::your-account:root" }, "Action": "kms:*", "Resource": "*" }, { "Sid": "Allow access for Key Administrators", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::your-account:role/your-admin-role" }, "Action": [ "kms:Create*", "kms:Describe*", "kms:Enable*", "kms:List*", "kms:Put*", "kms:Update*", "kms:Revoke*", "kms:Disable*", "kms:Get*", "kms:Delete*", "kms:TagResource", "kms:UntagResource", "kms:ScheduleKeyDeletion", "kms:CancelKeyDeletion" ], "Resource": "*" }, { "Sid": "Allow use of the key", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::your-account:role/your-datasync-role" }, "Action": [ "kms:Encrypt", "kms:Decrypt", "kms:ReEncrypt*", "kms:GenerateDataKey*" ], "Resource": "*" }, { "Sid": "Allow attachment of persistent resources", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::your-account:role/your-datasync-role" }, "Action": [ "kms:CreateGrant", "kms:ListGrants", "kms:RevokeGrant" ], "Resource": "*", "Condition": { "Bool": { "kms:GrantIsForAWSResource": "true" } } } ] }