Architecture overview - Amazon S3 Glacier Re:Freezer

Architecture overview

Deploying this solution with the default parameters builds the following environment in the AWS Cloud.

Figure 2: Amazon S3 Glacier Re:Freezer reference architecture diagram

The solution operates in the following four stages:

  • Stage one – Get Inventory: The solution obtains the latest Glacier vault inventory file.

  • Stage two – Request Archives: The solution parses, partitions, and optimizes the Glacier vault inventory file, and then starts the optimized restore order request from Amazon S3 Glacier.

  • Stage three – Archive Copy: The solution begins the archive copy process to the staging Amazon S3 bucket and Amazon S3 Standard storage class. During the archive copy process, Amazon DynamoDB tracks the status of the archive copies and collects metrics visible in the provided Amazon CloudWatch dashboard.

  • Stage four – Archive Integrity Check: The solution uses SHA256 Treehash to perform archive integrity checking on the copied object and match it with the SHA256 Treehash as recorded by Amazon S3 Glacier in the Glacier vault inventory list. After the SHA256 Treehash is validated, the object is moved from the staging S3 bucket to the destination S3 bucket and S3 storage class.

The AWS CloudFormation template automatically creates AWS Lambda functions which perform the following tasks:

  1. Request and download the inventory file for the Amazon S3 Glacier vault.

  2. Request archives from Amazon S3 Glacier vault.

  3. Perform the archive copy function to the staging Amazon S3 bucket.

  4. Calculate SHA256 Treehash of copied objects.

  5. Move the validated objects to the destination Amazon S3 bucket.

  6. Collect and post metrics to Amazon CloudWatch.

  7. Send anonymous statistics to the Solution Builder endpoint (if you elect to send anonymous statistics).

The solution uses the following services:

  • Amazon Simple Notification Service (Amazon SNS) — communicates with the Amazon S3 Glacier service.

  • Amazon Simple Queue Service (Amazon SQS) — decouples Lambda steps.

  • Amazon DynamoDB — keeps track of the archive copy processing state and collects progress metrics.

  • Amazon CloudWatch — stores the solution logs and metrics, and presents a custom dashboard to enable visibility of the archive copy operation progress and any encountered errors.

  • Amazon Athena — queries against your Glacier vault inventory list.

  • AWS Glue — reorders and splits the Amazon S3 Glacier vault inventory file into partitions to be processed by multiple AWS Lambda invocations.

  • AWS Step Functions — orchestrate partitioning the inventory with AWS Glue, updating the total archive count in DynamoDB table, uploading anonymous statistics, and AWS Lambda invocations to request Glacier vault archives retrieval.

  • Amazon S3 — creates an Amazon S3 bucket for the staging area to temporarily store the copied S3 Glacier vault archives.

Note

AWS CloudFormation resources are created from AWS Cloud Development Kit (AWS CDK) constructs.

Translation of Glacier vault archive descriptions to S3 object names

This solution uses the value stored in the ArchiveDescription field for each Glacier Vault ArchiveId listed in the Glacier Inventory file, as the key name for the new Amazon S3 object it creates as part of the copy process.

The following are example scenarios describing how the solution translates the value stored in the ArchiveDescription for each ArchiveId into Amazon S3 object key names.

  1. If the ArchiveDescription is a single string value such data01, that translates to a S3 object key name in the destination Amazon S3 bucket (Figure 2).

Figure 3: Single value ArchiveDescription

  1. If the ArchiveDescription value is blank, then the solution copies the archive, uses the ArchiveId as the S3 object key name, and stores the objects in a folder labeled 00undefined on the destination S3 bucket.

Figure 4: Blank ArchiveDescription value

  1. If you have multiple ArchiveId entries with have the same value for the ArchiveDescription field (for example, duplicatefile02.txt), then you have the potential for duplicate S3 object key names to be copied over one another. The solution resolves this by adding an additional timestamp suffix appended to the name of the original file.

Figure 5: Duplicate files with timestamps