AWS Snowball Performance
Following, you can find information about AWS Snowball performance. Here, we discuss performance in general terms, because on-premises environments each have a different way of doing things—different network technologies, different hardware, different operating systems, different procedures, and so on. To provide meaningful guidance about data transfer performance, following we discuss how to determine when to use Snowball instead of data transfer over the Internet, and how to speed up transfer from your data source to the Snowball.
We highly recommend that you use a powerful computer as your workstation. Because the computer workstation from which or to which you make the data transfer is considered to be the bottleneck for transferring data, it should be able to meet high demands in terms of processing, memory, and networking. For more information, see Workstation Specifications.
Speeding Up Data Transfer
In general, you can improve the transfer speed from your data source to the Snowball in the following ways, ordered from largest to smallest positive impact on performance:
Perform multiple copy operations at one time – If your workstation is powerful enough, you can perform multiple
snowball cpcommands at one time. You can do this by running each command from a separate terminal window, in separate instances of the Snowball client, all connected to the same Snowball.
Copy from multiple workstations – A single Snowball can be connected to multiple workstations. Each workstation can host a separate instance of the Snowball client.
Transfer large files or batch small files together – Each copy operation has some overhead because of encryption. Therefore, performing many
snowball cpcommands on individual files has slower overall performance than transferring the same number of files in a single command. To speed the process up, batch files together in a single
snowball cpcommand. You can do this by copying entire directories of files, or by bundling the files together into larger archives. Because there is overhead for each
snowball cpcommand, we don't recommend that you queue a large number of individual copy commands. Queuing many commands has a significant negative impact on your transfer performance.
For example, say you have a directory called C:\\MyFiles that only contains three files, file1.txt, file2.txt, and file3.txt. Suppose that you issue the following three commands.
snowball cp C:\\MyFiles\file1.txt s3://mybucket snowball cp C:\\MyFiles\file2.txt s3://mybucket snowball cp C:\\MyFiles\file3.txt s3://mybucket
In this scenario, you have three times as much overhead as if you transferred the entire directory with the following copy command.
Snowball cp –r C:\\MyFiles\* s3://mybucket
Don't perform other operations on files during transfer – Renaming files during transfer, changing their metadata, or writing data to the files during a copy operation has a significant negative impact on transfer performance. We recommend that your files remain in a static state while you transfer them.
Reducing local network use – Because the Snowball communicates across your local network, reducing or otherwise eliminating other local network traffic between the Snowball, the switch it's connected to, and the workstation that hosts your data source can result in a significant improvement of data transfer speeds.
Eliminating unnecessary hops – If you set up your Snowball, your data source, and your workstation so that they're the only machines communicating across a single switch, it can result in a significant improvement of data transfer speeds.
Experimenting to Get Better Performance
Because your performance results will vary based on your hardware, your network, how many and how large your files are, and how they're stored, we suggest that you experiment with your performance metrics if you're not getting the performance that you'd like to see.
First, attempt multiple copy operations until you see a reduction in overall transfer
performance. Performing multiple copy operations at once can have a significantly positive
impact on your overall transfer performance. For example, say you have a single
snowball cp command running in a terminal window, and you note that it's
transferring data at 30 MB/second. Say you open a second terminal window, and run a second
snowball cp command on another set of files that you want to transfer. Let's
assume that you note that both commands are performing at 30 MB/second. In this case, your
total transfer performance is 60 MB/second.
Now, connect to the Snowball from a separate workstation, and run the Snowball client from
that workstation to execute a third
snowball cp command on another set of files
that you want to transfer. Now when you check the performance, you note that all three
instances of the
snowball cp command are operating at a performance of 25
MB/second, with a total performance of 75 MB/second. Even though the individual performance
of each instance has decreased in this example, the overall performance has
Experimenting in this way, using the techniques listed in Speeding Up Data Transfer, will help you optimize your data transfer performance.
Why AWS Snowball Has Such High Hardware Specifications for Workstations
As outlined in Workstation Specifications, Snowball has stringent hardware specifications for the workstations that are used to transfer data to and from a Snowball. These hardware specifications are mainly based on security requirements for the service. When data is transferred to a Snowball, a file is loaded into the workstation's memory. While in memory, that file is fully encrypted by either Snowball client or the S3 SDK Adapter for Snowball. Once the file has been encrypted, chunks of the encrypted file are sent to the Snowball. At no point is any data stored to disk. All data is kept in memory, and only encrypted data is sent to the Snowball. This loading into memory, encrypting, chunking, and sending to the Snowball is both CPU- and memory-intensive.
Performance Considerations for HDFS Data Transfers
When getting ready to transfer data from a Hadoop Distributed File System (HDFS) cluster (version 2.x) into a Snowball, we recommend that you follow the guidance in the previous section, and also the following tips:
Don't copy the entire cluster over in a single command – Transferring an entire cluster in a single command can cause performance issues, including slow transfers, "flipped" bits, and missing or corrupted data on the Snowball. We recommend that in this case you separate the data transfer into multiple parts.
Don't transfer a large number of small files – If you have a large number of files, say over a thousand, and those files are small, say under a MB each in size, then transferring them all at once will have a negative impact on your performance. This performance degradation is due to per-file overhead associated with transferring data from HDFS clusters. If you must transfer a large number of small files, we recommend that you find a method of collecting them into larger archive files, and then transferring those. However, these archives will be what is imported into Amazon S3. Thus, if you want the files in their original state, you'll need to take them out of the archives after the archives are in the cloud.