Storing logs - Amazon EMR

Storing logs

To monitor your job progress on EMR Serverless and troubleshoot job failures, you can choose how EMR Serverless stores and serves application logs. When you submit a job run, you can specify managed storage, Amazon S3, and Amazon CloudWatch as your logging options.

With CloudWatch, you can specify the log types and log locations that you want to use, or accept the default types and locations. For more information on CloudWatch logs, see Logging for EMR Serverless with Amazon CloudWatch. With managed storage and S3 logging, the following table shows the log locations and UI availability that you can expect if you choose managed storage, Amazon S3 buckets, or both.

Option Event logs Container logs Application UI

Managed storage

Stored in managed storage

Stored in managed storage

Supported

Both managed storage and S3 bucket

Stored in both places

Stored in S3 bucket

Supported

Amazon S3 bucket

Stored in S3 bucket

Stored in S3 bucket

Not supported1

1 We recommend that you keep the Managed storage option selected. Otherwise, you can't use the built-in application UIs.

Logging for EMR Serverless with managed storage

By default, EMR Serverless stores application logs securely in Amazon EMR managed storage for a maximum of 30 days.

Note

If you turn off the default option, Amazon EMR can't troubleshoot your jobs on your behalf.

To turn off this option from EMR Studio, deselect the Allow AWS to retain logs for 30 days check box in the Additional settings section of the Submit job page.

To turn off this option from the AWS CLI, use the managedPersistenceMonitoringConfiguration configuration when you submit a job run.

{ "monitoringConfiguration": { "managedPersistenceMonitoringConfiguration": { "enabled": false } } }

Logging for EMR Serverless with Amazon S3 buckets

Before your jobs can send log data to Amazon S3, you must include the following permissions in the permissions policy for the job runtime role. Replace amzn-s3-demo-logging-bucket with the name of your logging bucket.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:PutObject" ], "Resource": [ "arn:aws:s3:::amzn-s3-demo-logging-bucket/*" ] } ] }

To set up an Amazon S3 bucket to store logs from the AWS CLI, use the s3MonitoringConfiguration configuration when you start a job run. To do this, provide the following --configuration-overrides in the configuration.

{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "s3://amzn-s3-demo-logging-bucket/logs/" } } }

For batch jobs that don't have retries enabled, EMR Serverless sends the logs to the following path:

'/applications/<applicationId>/jobs/<jobId>'

EMR Serverless releases 7.1.0 and higher support retry attempts for streaming jobs and batch jobs. If you run a job with retries enabled, EMR Serverless automatically adds an attempt number to the log path prefix, so you can better distinguish and track logs.

'/applications/<applicationId>/jobs/<jobId>/attempts/<attemptNumber>/'

Logging for EMR Serverless with Amazon CloudWatch

When you submit a job to an EMR Serverless application, you can choose Amazon CloudWatch as an option to store your application logs. This allows you to use CloudWatch log analysis features such as CloudWatch Logs Insights and Live Tail. You can also stream logs from CloudWatch to other systems such as OpenSearch for further analysis.

EMR Serverless provides real-time logging for driver logs. You can view the logs in real time with the CloudWatch live tail capability, or through CloudWatch CLI tail commands.

By default, CloudWatch logging is disabled for EMR Serverless. To enable it, see the configuration in AWS CLI.

Note

Amazon CloudWatch publishes logs in real time, so it incurs more resources from workers. If you choose a low worker capacity, the impact to your job run time might increase. If you enable CloudWatch logging, we recommend that you choose a greater worker capacity. It's also possible that log publication could throttle if the transactions per second (TPS) rate is too low for PutLogEvents. The CloudWatch throttling configuration is global to all services, including EMR Serverless. For more information, see How do I determine throttling in my CloudWatch logs? on AWS re:post.

Required permissions for logging with CloudWatch

Before your jobs can send log data to Amazon CloudWatch, you must include the following permissions in the permissions policy for the job runtime role.

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "logs:DescribeLogGroups" ], "Resource": [ "arn:aws:logs:AWS Region:111122223333:*" ] }, { "Effect": "Allow", "Action": [ "logs:PutLogEvents", "logs:CreateLogGroup", "logs:CreateLogStream", "logs:DescribeLogStreams" ], "Resource": [ "arn:aws:logs:AWS Region:111122223333:log-group:my-log-group-name:*" ] } ] }

AWS CLI

To set up Amazon CloudWatch to store logs for EMR Serverless from the AWS CLI, use the cloudWatchLoggingConfiguration configuration when you start a job run. To do this, provide the following configuration overrides. Optionally, you can also provide a log group name, log stream prefix name, log types, and an encryption key ARN.

If you don’t specify the optional values, then CloudWatch publishes the logs to a default log group /aws/emr-serverless, with the default log stream /applications/applicationId/jobs/jobId/worker-type.

EMR Serverless releases 7.1.0 and higher support retry attempts for streaming jobs and batch jobs. If you enabled retries for a job, EMR Serverless automatically adds an attempt number to the log path prefix, so you can better distinguish and track logs.

'/applications/<applicationId>/jobs/<jobId>/attempts/<attemptNumber>/worker-type'

The following shows the minimum configuration that is required to turn on Amazon CloudWatch logging with the default settings for EMR Serverless:

{ "monitoringConfiguration": { "cloudWatchLoggingConfiguration": { "enabled": true } } }

The following example shows all of the required and optional configurations that you can specify when you turn on Amazon CloudWatch logging for EMR Serverless. The supported logTypes values are also listed below this example.

{ "monitoringConfiguration": { "cloudWatchLoggingConfiguration": { "enabled": true, // Required "logGroupName": "Example_logGroup", // Optional "logStreamNamePrefix": "Example_logStream", // Optional "encryptionKeyArn": "key-arn", // Optional "logTypes": { "SPARK_DRIVER": ["stdout", "stderr"] //List of values } } } }

By default, EMR Serverless publishes only the driver stdout and stderr logs to CloudWatch. If you want other logs, then you can specify a container role and corresponding log types with the logTypes field.

The following list shows the supported worker types that you can specify for the logTypes configuration:

Spark
  • SPARK_DRIVER : ["STDERR", "STDOUT"]

  • SPARK_EXECUTOR : ["STDERR", "STDOUT"]

Hive
  • HIVE_DRIVER : ["STDERR", "STDOUT", "HIVE_LOG", "TEZ_AM"]

  • TEZ_TASK : ["STDERR", "STDOUT", "SYSTEM_LOGS"]