Menu
AWS Snowball
User Guide

This guide is for the standard Snowball (50TB or 80TB of storage space). If you are looking for documentation for the Snowball Edge, see the AWS Snowball Edge Developer Guide.

Importing Data from HDFS

You can import data into Amazon S3 from your on-premises Hadoop Distributed File System (HDFS) through a Snowball. You perform this import process by using the Snowball client. Importing from HDFS is not supported with the Amazon S3 Adapter for Snowball. Following, you can find information about how to prepare for and perform HDFS data transfer.

Note

Although you can write HDFS data to a Snowball, you can't write Hadoop data from a Snowball to your local HDFS. As a result, export jobs are not supported for HDFS.

Preparing for Transferring Your HDFS Data with the Snowball Client

Before you transfer your HDFS (version 2.x) data, you'll need to do the following:

  • Confirm the Kerberos authentication settings for your HDFS cluster – The Snowball client supports Kerberos authentication for communicating with your HDFS in two ways: with the Kerberos login already on the host system and with authentication through specifying a principal and keytab in the snowball cp command. Alternatively, you can copy from a nonsecured HDFS cluster.

  • Confirm that your workstation has the Hadoop client 2.x version installed on it – To use the Snowball client, your workstation needs to have the Hadoop client 2.x installed, running, and able to communicate with your HDFS 2.x cluster.

  • Confirm the location of your site-specific configuration files – If you are using site-specific configuration files, you need to use the --hdfsconfig parameter to pass the location of each XML file.

  • Confirm your Namenode URI – Each HDFS 2.x cluster has a Namenode.core-site.xml file. This file includes a property element with the name of fs.defaultFS and a value of hostname:port, for example hdfs://localhost:9000. You use this value, the Namenode URI, as a part of the source schema when you run Snowball client commands to perform operations on your HDFS cluster. For more information, see Sources for the Snowball Client Commands.

Note

Currently, only HDFS 2.X clusters are supported with Snowball. You can still transfer data from a HDFS 1.x cluster by staging the data that you want to transfer on a workstation, and then copying that data to the Snowball with the standard snowball cp commands and options.

When you have confirmed the information listed previously, identify the Amazon S3 bucket that you want your HDFS data imported into.

After your preparations for the HDFS import are complete, you can begin. If you haven't created your job yet, see Importing Data into Amazon S3 with Snowball until you reach Use the Snowball Client. At that point, return to this topic.

Before Transferring Data from HDFS

Before using the Snowball client to copy HDFS (version 2.x) data, take the following steps:

  1. To transfer data from an HDFS cluster, get the latest version of the Snowball client. You can download and install the Snowball client from the AWS Snowball Tools Download page. There you'll find the installation package for your operating system. Follow the instructions to install the Snowball client.

  2. Ensure that your HDFS cluster is running, and accessible from the workstation that you've installed the Snowball client on.

Transferring Data from HDFS

Now you're ready to transfer data from your HDFS (version 2.x) cluster. For more information on all the Snowball client copy command options, including those specific to HDFS, see Options for the snowball cp Command.

If you encounter performance issues while transferring data from your HDFS 2.x cluster to a Snowball, see Performance Considerations for HDFS Data Transfers.

After Transferring Data from HDFS

Once you've finished transferring data from your HDFS (version 2.x) cluster, you can validate the data on the Snowball with the following steps:

  1. Use the snowball validate command to verify the number of uploaded files and confirm that they were uploaded correctly.

  2. List all the files at the destination path or paths to confirm that the HDFS file or files were copied. For example, you can use the following command:

    Copy
    snowball ls s3://bucket-name/destination-path