Importing Data from HDFS
You can import data into Amazon S3 from your on-premises Hadoop Distributed File System (HDFS) through a Snowball. You perform this import process by using the Snowball client. Importing from HDFS is not supported with the Amazon S3 Adapter for Snowball. Following, you can find information about how to prepare for and perform HDFS data transfer.
Although you can write HDFS data to a Snowball, you can't write Hadoop data from a Snowball to your local HDFS. As a result, export jobs are not supported for HDFS.
Preparing for Transferring Your HDFS Data with the Snowball Client
Before you transfer your HDFS (version 2.x) data, you'll need to do the following:
Confirm the Kerberos authentication settings for your HDFS cluster – The Snowball client supports Kerberos authentication for communicating with your HDFS in two ways: with the Kerberos login already on the host system and with authentication through specifying a principal and keytab in the
snowball cpcommand. Alternatively, you can copy from a nonsecured HDFS cluster.
Confirm that your workstation has the Hadoop client 2.x version installed on it – To use the Snowball client, your workstation needs to have the Hadoop client 2.x installed, running, and able to communicate with your HDFS 2.x cluster.
Confirm the location of your site-specific configuration files – If you are using site-specific configuration files, you need to use the
--hdfsconfigparameter to pass the location of each XML file.
Confirm your Namenode URI – Each HDFS 2.x cluster has a Namenode.core-site.xml file. This file includes a
propertyelement with the name of
fs.defaultFSand a value of
, for example
hdfs://localhost:9000. You use this value, the Namenode URI, as a part of the source schema when you run Snowball client commands to perform operations on your HDFS cluster. For more information, see Sources for the Snowball Client Commands.
Currently, only HDFS 2.X clusters are supported with Snowball. You can still
transfer data from a HDFS 1.x cluster by staging the data that you want to transfer on
a workstation, and then copying that data to the Snowball with the standard
snowball cp commands and options.
When you have confirmed the information listed previously, identify the Amazon S3 bucket that you want your HDFS data imported into.
After your preparations for the HDFS import are complete, you can begin. If you haven't created your job yet, see Importing Data into Amazon S3 with Snowball until you reach Use the Snowball Client. At that point, return to this topic.
Before Transferring Data from HDFS
Before using the Snowball client to copy HDFS (version 2.x) data, take the following steps:
To transfer data from an HDFS cluster, get the latest version of the Snowball client. You can download and install the Snowball client from the AWS Snowball Tools Download page. There you'll find the installation package for your operating system. Follow the instructions to install the Snowball client.
Ensure that your HDFS cluster is running, and accessible from the workstation that you've installed the Snowball client on.
Transferring Data from HDFS
Now you're ready to transfer data from your HDFS (version 2.x) cluster. For more information on all the Snowball client copy command options, including those specific to HDFS, see Options for the snowball cp Command.
If you encounter performance issues while transferring data from your HDFS 2.x cluster to a Snowball, see Performance Considerations for HDFS Data Transfers.
After Transferring Data from HDFS
Once you've finished transferring data from your HDFS (version 2.x) cluster, you can validate the data on the Snowball with the following steps:
snowball validatecommand to verify the number of uploaded files and confirm that they were uploaded correctly.
List all the files at the destination path or paths to confirm that the HDFS file or files were copied. For example, you can use the following command:
snowball ls s3://