Downloading an Archive in Amazon Glacier
Retrieving an archive from Amazon Glacier is a two-step process:
Initiate an archive retrieval job.
When you initiate a job, Amazon Glacier returns a job ID in the response and executes the job asynchronously. After the job completes, you can download the job output.
You can initiate a job requesting Amazon Glacier to prepare an entire archive or a portion of the archive for subsequent download. When an archive is very large, you may find it cost effective to initiate several sequential jobs to prepare your archive.
For example, to retrieve a 1 GB archive, you may choose to send a series of four initiate archive-retrieval job requests, each time requesting Amazon Glacier to prepare only a 256 MB portion of the archive. You can send the series of initiate requests anytime; however, it is more cost effective if you wait for a previous initiate request to complete before sending the next request. For more information about the benefits of range retrievals, see About Range Retrievals.
To initiate an archive retrieval job you must know the ID of the archive that you want to retrieve. You can get the archive ID from an inventory of the vault. For more information, see Downloading a Vault Inventory in Amazon Glacier.
A data retrieval policy can cause your initiate retrieval job request to fail with a
PolicyEnforcedExceptionexception. For more information about data retrieval policies, see Amazon Glacier Data Retrieval Policies. For more information about the
PolicyEnforcedExceptionexception, see Error Responses.
After the job completes, download the bytes.
You can download all the bytes or specify a byte range to download only a portion of the output. For larger output, downloading the output in chunks helps in the event of a download failure, such as a network failure. If you get job output in a single request and there is a network failure, you have to restart downloading the output from the beginning. However, if you download the output in chunks, in the event of any failure, you need only restart the download of the smaller portion and not the entire output.
Most Amazon Glacier jobs take about four hours to complete. Amazon Glacier must complete a job before you can get its output. A job will not expire for at least 24 hours after completion, which means you can download the output within the 24-hour period after the job is completed. To determine if your job is complete, check its status by using one of these options:
Wait for job completion notification—You can specify an Amazon Simple Notification Service (Amazon SNS) topic to which Amazon Glacier can post a notification after the job is completed. Amazon Glacier sends notification only after it completes the job.
You can specify an Amazon SNS topic per job; that is, you can specify an Amazon SNS topic when you initiate a job. In addition to specifying an Amazon SNS topic in your job request, if your vault has notifications configuration set for archive retrieval events, then Amazon Glacier publishes a notification to that SNS topic as well. For more information, see Configuring Vault Notifications in Amazon Glacier.
Request job information explicitly—Amazon Glacier provides a describe job operation (Describe Job (GET JobID)) that enables you to poll for job information. You can periodically send this request to obtain job information. However, using Amazon SNS notifications is the recommended option.
The information you get via SNS notification is the same as what you get by calling Describe Job.
About Range Retrievals
When you retrieve an archive from Amazon Glacier, you can optionally specify a range, or portion, of the archive to retrieve. The default is to retrieve the whole archive. Specifying a range of bytes can be helpful when you want to:
Control bandwidth costs – Each month, Amazon Glacier allows you to retrieve up to 5 percent of your average monthly storage (pro-rated daily) for free. Scheduling range retrievals can help you in two ways. First, it may help you meet the monthly free allowance of 5 percent by spreading out the data you request. Second, if the amount of data you want to retrieve is such that you don't meet the free allowance percentage, scheduling range retrievals enables you to reduce your peak retrieval rate, which determines your retrieval fees. For examples of retrieval fee calculations, go to Amazon Glacier FAQs.
Manage your data downloads – Amazon Glacier allows retrieved data to be downloaded for 24 hours after the retrieval request completes. Therefore, you might want to retrieve only portions of the archive so that you can manage the schedule of downloads within the given download window.
Retrieve a targeted part of a large archive – For example, suppose you have previously aggregated many files and uploaded them as a single archive, and now you want to retrieve a few of the files. In this case, you can specify a range of the archive that contains the files you are interested in by using one retrieval request. Or, you can initiate multiple retrieval requests, each with a range for one or more files.
When initiating a retrieval job using range retrievals, you must provide a range that is megabyte aligned. This means that the byte range can start at zero (which would be the beginning of your archive), or at any 1 MB interval thereafter (1 MB, 2 MB, 3 MB, etc.). The end of the range can either be the end of your archive or any 1 MB interval greater than the beginning of your range. Furthermore, if you want to get checksum values when you download the data (after the retrieval job completes), the range you request in the job initiation must also be tree-hash aligned. Checksums are a way you can ensure that your data was not corrupted during transmission. For more information about megabyte alignment and tree-hash alignment, see Receiving Checksums When Downloading Data.