Amazon EMR
Developer Guide

Generate Test Data

To generate the test data on the master node

  1. Connect to the master node of the cluster using SSH and run the commands shown in the following steps. Your client operating system determines which steps to use to connect to the cluster. For more information, see Connect to the Cluster.

  2. In the SSH window, from the home directory, create and navigate to the directory that will contain the test data using the following commands:

    mkdir test cd test
  3. Download the JAR containing a program that automatically creates the test data using the following command:

  4. Launch the program to create the test data using the following command. In this example, the command-line parameters specify an output path of /mnt/dbgen, and the size for the books, customers, and transactions tables to be 1 GB each.

    java -cp dbgen-1.0-jar-with-dependencies.jar DBGen -p /mnt/dbgen -b 1 -c 1 -t 1
  5. Create a new folder in the cluster's HDFS file system and copy the test data from the master node's local file system to HDFS using the following commands:

    hadoop fs -mkdir /data/ hadoop fs -put /mnt/dbgen/* /data/ hadoop fs -ls -h -R /data/