| « PreviousNext » | |
![]() ![]() ![]() | Did this page help you? Yes | No | Tell us about it... |
One of the things to decide as you plan your cluster is how much debugging support you want to make available. When you are first developing your data processing application, we recommend testing the application on a cluster processing a small, but representative, subset of your data. When you do this, you will likely want to take advantage of all the debugging tools that Amazon EMR offers, such as archiving log files to Amazon S3 and indexing the log files in Amazon SimpleDB.
When you've finished development and put your data processing application into full production, you may choose to scale back debugging. Doing so can save you the cost of storing log file archives in Amazon S3 and reduce processing load on the cluster as it no longer needs to write state to Amazon S3. The trade off, of course, is that if something goes wrong, you'll have fewer tools available to investigate the issue.
By default, each cluster writes log files on the master node. These are written to the /mnt/var/log/ directory. You can access them by using SSH to connect to the master node as described in Connect to the Master Node Using SSH. Because these logs exist on the master node, when the node terminates — either because the cluster was shut down or because an error occurred — these log files are no longer available.
You do not need to enable anything to have log files written on the master node. This is the default behavior of Amazon EMR and Hadoop.
A cluster generates several types of log files, including:
Step logs — These logs are generated by the Amazon EMR service and contain information about the cluster and the results of each step. The log files are stored in /mnt/var/log/hadoop/steps/ directory on the master node. Each step logs its results in a separate numbered subdirectory: /mnt/var/log/hadoop/steps/1/ for the first step, /mnt/var/log/hadoop/steps/2/, for the second step, and so on.
Hadoop logs — These are the standard log files generated by Apache Hadoop. They contain information about Hadoop jobs, tasks, and task attempts. The log files are stored in /mnt/var/log/hadoop/ on the master node.
Bootstrap action logs — If your job uses bootstrap actions, the results of those actions are logged. The log files are stored in /mnt/var/log/bootstrap-actions/ on the master node. Each bootstrap action logs its results in a separate numbered subdirectory: /mnt/var/log/bootstrap-actions/1/ for the first bootstrap action, /mnt/var/log/bootstrap-actions/2/, for the second bootstrap action, and so on.
Instance state logs — These logs provide information about the CPU, memory state, and garbage collector threads of the node. The log files are stored in /mnt/var/log/instance-state/ on the master node.
You can configure a cluster to periodically archive the log files stored on the master node to Amazon S3. This ensures that the log files are available after the cluster terminates, whether this is through normal shut down or due to an error. Amazon EMR archives the log files to Amazon S3 at 5 minute intervals.
To have the log files archived to Amazon S3, you must enable this feature when you launch the cluster. You can do this using the console, the CLI, or the API. How to do this is shown in the following procedures.
To archive log files to Amazon S3 using the console
Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.
Click the Create New Job Flow button to launch a new cluster. Then follow the instructions in the wizard. For more information about launching a cluster, see Plan an Amazon EMR Cluster.
On the ADVANCED OPTIONS pane, enter a value for Amazon S3 Log Path that indicates where you want Amazon EMR to copy the log files.

To archive log files to Amazon S3 using the CLI
Set the --log-uri argument when you launch the cluster and specify a location in Amazon S3. Alternately, you can set this value in the credentials.json file that you configured for the CLI. This will cause all of the clusters you launch with the CLI to archive log files to the specified S3 bucket. For more information about credentials.json, see "Configuring Credentials" in Install the Amazon EMR Command Line Interface.
The following example illustrates creating a cluster that archives log files to Amazon S3. You would replace the value myawsbucket with the name of S3 bucket that you own.
Linux, UNIX, and Mac OS X users:
./elastic-mapreduce --create --log-uri s3://myawsbucketWindows users:
ruby elastic-mapreduce --create --log-uri s3://myawsbucketTo archive log files to Amazon S3 using the API
Specify a value for the LogUri parameter when you launch the cluster.
The following example illustrates creating a cluster that archives log files to Amazon S3. You would replace the value myawsbucket with the name of S3 bucket that you own. The calculated value values would be replaced by the results of calculating the signature for the query request, as described in How to Generate a Signature for a Query Request in Amazon EMR.
Action=RunJobFlow& Name=My%20Job%20Flow& Timestamp=2010-05-26T11%3A25%3A40-07%3A00& LogUri=s3%3A%2F%2Fmyawsbucket& Instances.InstanceCount=1& Instances.Ec2KeyName=MyKeyName& Instances.MasterInstanceType=m1.small& Instances.SlaveInstanceType=m1.small& Instances.KeepJobFlowAliveWhenNoSteps=true AWSAccessKeyId=calculated value& SignatureVersion=2& SignatureMethod=HmacSHA1& Signature=calculated value&
The debugging tool is a graphical user interface that you can use to browse the log files from the console. When you enable debugging on a cluster, Amazon EMR archives the log files to Amazon S3 and then indexes those file and stores the index in Amazon SimpleDB. You can then use the graphical interface to browse the step, job, task, and task attempt logs for the cluster in an intuitive way. An example of using the debugging tool to browse log files is shown in Step 7: View the Results.
To be able to use the graphical debugging tool, you must enable debugging when you launch the cluster. You can do this using the console, the CLI, or the API.
To enable the debugging tool using the Amazon EMR console
Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.
Click the Create New Job Flow button to launch a new cluster. Then follow the instructions in the wizard. For more information about launching a cluster, see Plan an Amazon EMR Cluster.
On the ADVANCED OPTIONS pane, enter a value for Amazon S3 Log Path that indicates where you want Amazon EMR to copy the log files. This is a prerequisite for enabling the debugging tool.

For Enable debugging, click Yes.
To enable the debugging tool using the CLI
Use the --enable-debugging argument when you create the cluster. You must also set the --log-uri argument and specify a location in Amazon S3 because archiving the log files to Amazon S3 is a prerequisite of the debugging tool. Alternately, you can set the --log-uri value in the credentials.json file that you configured for the CLI. For more information about credentials.json, see "Configuring Credentials" in Install the Amazon EMR Command Line Interface.
The following example illustrates creating a cluster that archives log files to Amazon S3. You would replace the value myawsbucket with the name of S3 bucket that you own.
Linux, UNIX, and Mac OS X users:
./elastic-mapreduce --create --enable-debugging \
--log-uri s3://myawsbucketWindows users:
ruby elastic-mapreduce --create --enable-debugging --log-uri s3://myawsbucketTo enable the debugging tool using the API
Add a step when you launch the cluster that runs s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar with the argument 3://us-east-1.elasticmapreduce/libs/state-pusher/0.1/fetch. This runs the Hadoop state pusher to push the state of Hadoop jobs in real time to Amazon SimpleDB. You must also set a value for the LogUri parameter because archiving the log files to Amazon S3 is a prerequisite of the debugging tool.
The following example illustrates creating a cluster that enables debugging. You would replace the value myawsbucket with the name of an S3 bucket that you own.
Action=RunJobFlow& Name=My%20Job%20Flow& Timestamp=2010-05-26T11%3A25%3A40-07%3A00& LogUri=s3%3A%2F%2Fmyawsbucket& Instances.InstanceCount=1& Instances.Ec2KeyName=MyKeyName& Instances.MasterInstanceType=m1.small& Instances.SlaveInstanceType=m1.small& Instances.KeepJobFlowAliveWhenNoSteps=true Steps.member.1.Name=Enable%20Debugging& Steps.member.1.ActionOnFailure=TERMINATE_JOB_FLOW& Steps.member.1.HadoopJarStep.Jar=s3%3A%2F%2Fus-east-1.elasticmapreduce%2Flibs%2Fscript-runner%2Fscript-runner.jar& Steps.member.1.HadoopJarStep.Args.member.1=s3%3A%2F%2Fus-east-1.elasticmapreduce%2Flibs%2Fstate-pusher%2F0.1%2Ffetch& AWSAccessKeyId=calculated value& SignatureVersion=2& SignatureMethod=HmacSHA1& Signature=calculated value&