Configuring AWS DataSync transfers with an HDFS cluster - AWS DataSync

Configuring AWS DataSync transfers with an HDFS cluster

With AWS DataSync, you can transfer data between your Hadoop Distributed File System (HDFS) cluster and one of the following AWS storage services:

To set up this kind of transfer, you create a location for your HDFS cluster. You can use this location as a transfer source or destination.

Providing DataSync access to HDFS clusters

To connect to your HDFS cluster, DataSync uses an agent that you deploy as close as possible to your HDFS cluster. The DataSync agent acts as an HDFS client and communicates with the NameNodes and DataNodes in your cluster.

When you start a transfer task, DataSync queries the NameNode for locations of files and folders on the cluster. If you configure your HDFS location as a source location, DataSync reads files and folder data from the DataNodes in your cluster and copies that data to the destination. If you configure your HDFS location as a destination location, then DataSync writes files and folders from the source to the DataNodes in your cluster.

Authentication

When connecting to an HDFS cluster, DataSync supports simple authentication or Kerberos authentication. To use simple authentication, provide the user name of a user with rights to read and write to the HDFS cluster. To use Kerberos authentication, provide a Kerberos configuration file, a Kerberos key table (keytab) file, and a Kerberos principal name. The credentials of the Kerberos principal must be in the provided keytab file.

Encryption

When using Kerberos authentication, DataSync supports encryption of data as it's transmitted between the DataSync agent and your HDFS cluster. Encrypt your data by using the Quality of Protection (QOP) configuration settings on your HDFS cluster and by specifying the QOP settings when creating your HDFS location. The QOP configuration includes settings for data transfer protection and Remote Procedure Call (RPC) protection.

DataSync supports the following Kerberos encryption types:
  • des-cbc-crc

  • des-cbc-md4

  • des-cbc-md5

  • des3-cbc-sha1

  • arcfour-hmac

  • arcfour-hmac-exp

  • aes128-cts-hmac-sha1-96

  • aes256-cts-hmac-sha1-96

  • aes128-cts-hmac-sha256-128

  • aes256-cts-hmac-sha384-192

  • camellia128-cts-cmac

  • camellia256-cts-cmac

You can also configure HDFS clusters for encryption at rest using Transparent Data Encryption (TDE). When using simple authentication, DataSync reads and writes to TDE-enabled clusters. If you're using DataSync to copy data to a TDE-enabled cluster, first configure the encryption zones on the HDFS cluster. DataSync doesn't create encryption zones.

Unsupported HDFS features

The following HDFS capabilities aren't currently supported by DataSync:

  • Transparent Data Encryption (TDE) when using Kerberos authentication

  • Configuring multiple NameNodes

  • Hadoop HDFS over HTTP (HttpFS)

  • POSIX access control lists (ACLs)

  • HDFS extended attributes (xattrs)

  • HDFS clusters using Apache HBase

Creating your HDFS transfer location

You can use your location as a source or destination for your DataSync transfer.

Before you begin: Verify network connectivity between your agent and Hadoop cluster by doing the following:

  1. Open the AWS DataSync console at https://console.aws.amazon.com/datasync/.

  2. In the left navigation pane, expand Data transfer, then choose Locations and Create location.

  3. For Location type, choose Hadoop Distributed File System (HDFS).

    You can configure this location as a source or destination later.

  4. For Agents, choose the agent that can connect to your HDFS cluster.

    You can choose more than one agent. For more information, see Using multiple DataSync agents.

  5. For NameNode, provide the domain name or IP address of your HDFS cluster's primary NameNode.

  6. For Folder, enter a folder on your HDFS cluster that you want DataSync to use for the data transfer.

    If your HDFS location is a source, DataSync copies the files in this folder to the destination. If your location is a destination, DataSync writes files to this folder.

  7. To set the Block size or Replication factor, choose Additional settings.

    The default block size is 128 MiB. The block sizes that you provide must be a multiple of 512 bytes.

    The default replication factor is three DataNodes when transferring to the HDFS cluster.

  8. In the Security section, choose the Authentication type used on your HDFS cluster.

    • Simple – For User, specify the user name with the following permissions on the HDFS cluster (depending on your use case):

      • If you plan to use this location as a source location, specify a user that only has read permissions.

      • If you plan to use this location as a destination location, specify a user that has read and write permissions.

      Optionally, specify the URI of the Key Management Server (KMS) of your HDFS cluster.

    • Kerberos – Specify the Kerberos Principal with access to your HDFS cluster. Next, provide the KeyTab file that contains the provided Kerberos principal. Then, provide the Kerberos configuration file. Finally, specify the type of encryption in transit protection in the RPC protection and Data transfer protection dropdown lists.

  9. (Optional) Choose Add tag to tag your HDFS location.

    Tags are key-value pairs that help you manage, filter, and search for your locations. We recommend creating at least a name tag for your location.

  10. Choose Create location.

  1. Copy the following create-location-hdfs command.

    aws datasync create-location-hdfs --name-nodes [{"Hostname":"host1", "Port": 8020}] \ --authentication-type "SIMPLE|KERBEROS" \ --agent-arns [arn:aws:datasync:us-east-1:123456789012:agent/agent-01234567890example] \ --subdirectory "/path/to/my/data"
  2. For the --name-nodes parameter, specify the hostname or IP address of your HDFS cluster's primary NameNode and the TCP port that the NameNode is listening on.

  3. For the --authentication-type parameter, specify the type of authentication to use when connecting to the Hadoop cluster. You can specify SIMPLE or KERBEROS.

    If you use SIMPLE authentication, use the --simple-user parameter to specify the user name of the user. If you use KERBEROS authentication, use the --kerberos-principal, --kerberos-keytab, and --kerberos-krb5-conf parameters. For more information, see create-location-hdfs.

  4. For the --agent-arns parameter, specify the ARN of the DataSync agent that can connect to your HDFS cluster.

    You can choose more than one agent. For more information, see Using multiple DataSync agents.

  5. (Optional) For the --subdirectory parameter, specify a folder on your HDFS cluster that you want DataSync to use for the data transfer.

    If your HDFS location is a source, DataSync copies the files in this folder to the destination. If your location is a destination, DataSync writes files to this folder.

  6. If the command is successful, you get a response that shows you the ARN of the location that you created. For example:

    { "arn:aws:datasync:us-east-1:123456789012:location/loc-01234567890example" }