Amazon Elastic MapReduce
Developer Guide (API Version 2009-03-31)
Did this page help you?  Yes | No |  Tell us about it...
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.

Configure Logging and Debugging (Optional)

One of the things to decide as you plan your cluster is how much debugging support you want to make available. When you are first developing your data processing application, we recommend testing the application on a cluster processing a small, but representative, subset of your data. When you do this, you will likely want to take advantage of all the debugging tools that Amazon EMR offers, such as archiving log files to Amazon S3.

When you've finished development and put your data processing application into full production, you may choose to scale back debugging. Doing so can save you the cost of storing log file archives in Amazon S3 and reduce processing load on the cluster as it no longer needs to write state to Amazon S3. The trade off, of course, is that if something goes wrong, you'll have fewer tools available to investigate the issue.

Default Log Files

By default, each cluster writes log files on the master node. These are written to the /mnt/var/log/ directory. You can access them by using SSH to connect to the master node as described in Connect to the Master Node Using SSH. Because these logs exist on the master node, when the node terminates—either because the cluster was shut down or because an error occurred—these log files are no longer available.

You do not need to enable anything to have log files written on the master node. This is the default behavior of Amazon EMR and Hadoop.

A cluster generates several types of log files, including:

  • Step logs — These logs are generated by the Amazon EMR service and contain information about the cluster and the results of each step. The log files are stored in /mnt/var/log/hadoop/steps/ directory on the master node. Each step logs its results in a separate numbered subdirectory: /mnt/var/log/hadoop/steps/1/ for the first step, /mnt/var/log/hadoop/steps/2/, for the second step, and so on.

  • Hadoop logs — These are the standard log files generated by Apache Hadoop. They contain information about Hadoop jobs, tasks, and task attempts. The log files are stored in /mnt/var/log/hadoop/ on the master node.

  • Bootstrap action logs — If your job uses bootstrap actions, the results of those actions are logged. The log files are stored in /mnt/var/log/bootstrap-actions/ on the master node. Each bootstrap action logs its results in a separate numbered subdirectory: /mnt/var/log/bootstrap-actions/1/ for the first bootstrap action, /mnt/var/log/bootstrap-actions/2/, for the second bootstrap action, and so on.

  • Instance state logs — These logs provide information about the CPU, memory state, and garbage collector threads of the node. The log files are stored in /mnt/var/log/instance-state/ on the master node.

Archive Log Files to Amazon S3

You can configure a cluster to periodically archive the log files stored on the master node to Amazon S3. This ensures that the log files are available after the cluster terminates, whether this is through normal shut down or due to an error. Amazon EMR archives the log files to Amazon S3 at 5 minute intervals.

To have the log files archived to Amazon S3, you must enable this feature when you launch the cluster. You can do this using the console, the CLI, or the API.

To archive log files to Amazon S3 using the console

  1. Open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create cluster.

  3. In the Cluster Configuration section, in the Logging field, choose Enabled.

    This determines whether Amazon EMR captures detailed log data to Amazon S3. You can only set this when the cluster is created. For more information, see View Log Files.

  4. In the Log folder S3 location field, type an Amazon S3 path to store your logs. If you type the name of a folder that does not exist in the bucket, it is created for you.

    Log folder location

    When this value is set, Amazon EMR copies the log files from the EC2 instances in the cluster to Amazon S3. This prevents the log files from being lost when the cluster ends and the EC2 instances hosting the cluster are terminated. These logs are useful for troubleshooting purposes.

    For more information, see View Log Files.

  5. Proceed with creating the cluster as described in Plan an Amazon EMR Cluster.

To archive log files to Amazon S3 using the AWS CLI

To archive log files to Amazon S3 using the AWS CLI, type the create-cluster command and specify the Amazon S3 log path using the --log-uri parameter.

  • To log files to Amazon S3 type the following command and replace myKey with the name of your EC2 key pair.

    • Linux, UNIX, and Mac OS X users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --log-uri s3://mybucket/logs/ \
      --applications Name=Hue Name=Hive Name=Pig \
      --use-default-roles --ec2-attributes KeyName=myKey \
      --instance-type m3.xlarge --instance-count 3
    • Windows users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --log-uri s3://mybucket/logs/ --applications Name=Hue Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3

When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

Note

If you have not previously created the default EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To archive log files to Amazon S3 using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • Set the --log-uri argument when you launch the cluster and specify a location in Amazon S3. Alternatively, you can set this value in the credentials.json file that you configured for the CLI. This causes all of the clusters you launch with the CLI to archive log files to the specified Amazon S3 bucket. For more information about credentials.json, see "Configuring Credentials" in Install the Amazon EMR Command Line Interface. The following example illustrates creating a cluster that archives log files to Amazon S3. Replace mybucket with the name of your bucket.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --log-uri s3://mybucket
    • Windows users:

      ruby elastic-mapreduce --create --log-uri s3://mybucket

To aggregate logs in Amazon S3 using the AWS CLI

Log aggregation in Hadoop 2.x compiles logs from all containers for an individual application into a single file. This option is only available on Hadoop 2.x AMIs. To enable log aggregation to Amazon S3 using the AWS CLI, you use a bootstrap action at cluster launch to enable log aggregation and to specify the bucket to store the logs.

  • To enable log aggregation, type the following command and replace myKey with the name of your EC2 key pair.

    • Linux, UNIX, and Mac OS X users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 \
      --applications Name=Hue Name=Hive Name=Pig \
      --use-default-roles --ec2-attributes KeyName=myKey \
      --bootstrap-action Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop,Name="aggregate logs",Args=["-y","yarn.log-aggregation-enable=true","-y","yarn.log-aggregation.retain-seconds=-1","-y","yarn.log-aggregation.retain-check-interval-seconds=3000","-y","yarn.nodemanager.remote-app-log-dir=s3://mybucket/logs"] \
      --instance-type m3.xlarge --instance-count 3
    • Windows users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --applications Name=Hue Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --bootstrap-action Path=s3://elasticmapreduce/bootstrap-actions/configure-hadoop,Name="aggregate logs",Args=["-y","yarn.log-aggregation-enable=true","-y","yarn.log-aggregation.retain-seconds=-1","-y","yarn.log-aggregation.retain-check-interval-seconds=3000","-y","yarn.nodemanager.remote-app-log-dir=s3://mybucket/logs"] --instance-type m3.xlarge --instance-count 3

    When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    Note

    If you have not previously created the default EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To aggregate logs in Amazon S3 using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • Log aggregation in Hadoop 2.x compiles logs from all containers for an individual application into a single file. This option is only available on Hadoop 2.x AMIs. To enable log aggregation to Amazon S3 using the Amazon EMR CLI, you use a bootstrap action at cluster launch to enable log aggregation and to specify the bucket to store the logs.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --alive --master-instance-type m1.xlarge --slave-instance-type m1.xlarge \
            --num-instances 1 --ami-version 3.3 --bootstrap-action \
            s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args \
            "-y,yarn.log-aggregation-enable=true,-y,yarn.log-aggregation.retain-seconds=-1,-y,yarn.log-aggregation.retain-check-interval-seconds=3000,\
      	-y,yarn.nodemanager.remote-app-log-dir=s3://mybucket/logs" \
            --ssh --name "log aggregation sub-bucket name"
    • Windows users:

      ruby elastic-mapreduce --create --alive --master-instance-type m1.xlarge --slave-instance-type m1.xlarge --num-instances 1 --ami-version 3.3 --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "-y,yarn.log-aggregation-enable=true,-y,yarn.log-aggregation.retain-seconds=-1,-y,yarn.log-aggregation.retain-check-interval-seconds=3000,-y,yarn.nodemanager.remote-app-log-dir=s3://mybucket/logs" --ssh --name "log aggregation sub-bucket name"

Enable the Debugging Tool

The debugging tool is a graphical user interface that you can use to browse the log files from the console. When you enable debugging on a cluster, Amazon EMR archives the log files to Amazon S3 and then indexes those files. You can then use the graphical interface to browse the step, job, task, and task attempt logs for the cluster in an intuitive way. An example of using the debugging tool to browse log files is shown in View the Results.

To be able to use the graphical debugging tool, you must enable debugging when you launch the cluster. You can do this using the console, the CLI, or the API.

To enable the debugging tool using the Amazon EMR console

  1. Open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create cluster.

  3. In the Cluster Configuration section, in the Logging field, choose Enabled. You cannot enable debugging without enabling logging.

  4. In the Log folder S3 location field, type an Amazon S3 path to store your logs.

  5. In the Debugging field, choose Enabled.

    The debugging option creates a debug log index in Amazon SimpleDB (additional charges apply) to enable detailed debugging in the Amazon EMR console. You can only set this when the cluster is created. For more information, go to the Amazon SimpleDB product description page.

    Note

    Debugging is only supported in regions where Amazon SimpleDB is available.

    Enable debugging in the console
  6. Proceed with creating the cluster as described in Plan an Amazon EMR Cluster.

To enable the debugging tool using the AWS CLI

To enable debugging using the AWS CLI, type the create-cluster subcommand with the --enable-debugging parameter. You must also specify the --log-uri parameter when enabling debugging.

  • To enable debugging using the AWS CLI, type the following command and replace myKey with the name of your EC2 key pair.

    • Linux, UNIX, and Mac OS X users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --log-uri s3://mybucket/logs/ \
      --enable-debugging --applications Name=Hue Name=Hive Name=Pig \
      --use-default-roles --ec2-attributes KeyName=myKey \
      --instance-type m3.xlarge --instance-count 3
    • Windows users:

      aws emr create-cluster --name "Test cluster" --ami-version 3.3 --log-uri s3://mybucket/logs/ --enable-debugging --applications Name=Hue Name=Hive Name=Pig --use-default-roles --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance-count 3

    When you specify the instance count without using the --instance-groups parameter, a single Master node is launched, and the remaining instances are launched as core nodes. All nodes will use the instance type specified in the command.

    Note

    If you have not previously created the default EMR service role and EC2 instance profile, type aws emr create-default-roles to create them before typing the create-cluster subcommand.

For more information on using Amazon EMR commands in the AWS CLI, see http://docs.aws.amazon.com/cli/latest/reference/emr.

To enable the debugging tool using the Amazon EMR CLI

Note

The Amazon EMR CLI is no longer under feature development. Customers are encouraged to use the Amazon EMR commands in the AWS CLI instead.

  • Use the --enable-debugging argument when you create the cluster. You must also set the --log-uri argument and specify a location in Amazon S3 because archiving the log files to Amazon S3 is a prerequisite of the debugging tool. Alternately, you can set the --log-uri value in the credentials.json file that you configured for the CLI. For more information about credentials.json, see "Configuring Credentials" in Install the Amazon EMR Command Line Interface. The following example illustrates creating a cluster that archives log files to Amazon S3. Replace mybucket with the name of your bucket.

    • Linux, UNIX, and Mac OS X users:

      ./elastic-mapreduce --create --enable-debugging \
           --log-uri s3://mybucket
    • Windows users:

      ruby elastic-mapreduce --create --enable-debugging --log-uri s3://mybucket