Storing logs
To monitor your job progress on EMR Serverless and troubleshoot job failures, you can choose how EMR Serverless stores and serves application logs. When you submit a job run, you can specify managed storage, Amazon S3, and Amazon CloudWatch as your logging options.
With CloudWatch, you can specify the log types and log locations that you want to use, or accept the default types and locations. For more information on CloudWatch logs, see Logging for EMR Serverless with Amazon CloudWatch. With managed storage and S3 logging, the following table shows the log locations and UI availability that you can expect if you choose managed storage, Amazon S3 buckets, or both.
Option | Event logs | Container logs | Application UI |
---|---|---|---|
Managed storage |
Stored in managed storage |
Stored in managed storage |
Supported |
Both managed storage and S3 bucket |
Stored in both places |
Stored in S3 bucket |
Supported |
Amazon S3 bucket |
Stored in S3 bucket |
Stored in S3 bucket |
Not supported1 |
1 We recommend that you keep the Managed storage option selected. Otherwise, you can't use the built-in application UIs.
Logging for EMR Serverless with managed storage
By default, EMR Serverless stores application logs securely in Amazon EMR managed storage for a maximum of 30 days.
Note
If you turn off the default option, Amazon EMR can't troubleshoot your jobs on your behalf.
To turn off this option from EMR Studio, deselect the Allow AWS to retain logs for 30 days check box in the Additional settings section of the Submit job page.
To turn off this option from the AWS CLI, use the
managedPersistenceMonitoringConfiguration
configuration when you
submit a job run.
{ "monitoringConfiguration": { "managedPersistenceMonitoringConfiguration": { "enabled": false } } }
Logging for EMR Serverless with Amazon S3 buckets
Before your jobs can send log data to Amazon S3, you must include the following
permissions in the permissions policy for the job runtime role. Replace
with the
name of your logging bucket.amzn-s3-demo-logging-bucket
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:PutObject" ], "Resource": [ "arn:aws:s3:::
amzn-s3-demo-logging-bucket
/*" ] } ] }
To set up an Amazon S3 bucket to store logs from the AWS CLI, use the
s3MonitoringConfiguration
configuration when you start a job run.
To do this, provide the following --configuration-overrides
in the
configuration.
{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "s3://
amzn-s3-demo-logging-bucket
/logs/" } } }
For batch jobs that don't have retries enabled, EMR Serverless sends the logs to the following path:
'/applications/<applicationId>/jobs/<jobId>'
EMR Serverless releases 7.1.0 and higher support retry attempts for streaming jobs and batch jobs. If you run a job with retries enabled, EMR Serverless automatically adds an attempt number to the log path prefix, so you can better distinguish and track logs.
'/applications/<applicationId>/jobs/<jobId>/attempts/<attemptNumber>/'
Logging for EMR Serverless with Amazon CloudWatch
When you submit a job to an EMR Serverless application, you can choose Amazon CloudWatch as an option to store your application logs. This allows you to use CloudWatch log analysis features such as CloudWatch Logs Insights and Live Tail. You can also stream logs from CloudWatch to other systems such as OpenSearch for further analysis.
EMR Serverless provides real-time logging for driver logs. You can view the logs in real time with the CloudWatch live tail capability, or through CloudWatch CLI tail commands.
By default, CloudWatch logging is disabled for EMR Serverless. To enable it, see the configuration in AWS CLI.
Note
Amazon CloudWatch publishes logs in real time, so it incurs more resources from
workers. If you choose a low worker capacity, the impact to your job run time
might increase. If you enable CloudWatch logging, we recommend that you choose a
greater worker capacity. It's also possible that log publication could throttle
if the transactions per second (TPS) rate is too low for
PutLogEvents
. The CloudWatch throttling configuration is global to
all services, including EMR Serverless. For more information, see How do
I determine throttling in my CloudWatch logs?
Required permissions for logging with CloudWatch
Before your jobs can send log data to Amazon CloudWatch, you must include the following permissions in the permissions policy for the job runtime role.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "logs:DescribeLogGroups" ], "Resource": [ "arn:aws:logs:
AWS Region
:111122223333
:*" ] }, { "Effect": "Allow", "Action": [ "logs:PutLogEvents", "logs:CreateLogGroup", "logs:CreateLogStream", "logs:DescribeLogStreams" ], "Resource": [ "arn:aws:logs:AWS Region
:111122223333
:log-group:my-log-group-name
:*" ] } ] }
AWS CLI
To set up Amazon CloudWatch to store logs for EMR Serverless from the AWS CLI, use the
cloudWatchLoggingConfiguration
configuration when you start a
job run. To do this, provide the following configuration overrides. Optionally,
you can also provide a log group name, log stream prefix name, log types, and an
encryption key ARN.
If you don’t specify the optional values, then CloudWatch publishes the logs to a
default log group /aws/emr-serverless
, with the default log stream
/applications/
.applicationId
/jobs/jobId
/worker-type
EMR Serverless releases 7.1.0 and higher support retry attempts for streaming jobs and batch jobs. If you enabled retries for a job, EMR Serverless automatically adds an attempt number to the log path prefix, so you can better distinguish and track logs.
'/applications/
<applicationId>
/jobs/<jobId>
/attempts/<attemptNumber>
/worker-type'
The following shows the minimum configuration that is required to turn on Amazon CloudWatch logging with the default settings for EMR Serverless:
{ "monitoringConfiguration": { "cloudWatchLoggingConfiguration": { "enabled": true } } }
The following example shows all of the required and optional configurations
that you can specify when you turn on Amazon CloudWatch logging for EMR Serverless. The
supported logTypes
values are also listed below this
example.
{ "monitoringConfiguration": { "cloudWatchLoggingConfiguration": { "enabled": true, // Required "logGroupName": "Example_logGroup", // Optional "logStreamNamePrefix": "Example_logStream", // Optional "encryptionKeyArn": "key-arn", // Optional "logTypes": { "SPARK_DRIVER": ["stdout", "stderr"] //List of values } } } }
By default, EMR Serverless publishes only the driver stdout and stderr logs
to CloudWatch. If you want other logs, then you can specify a container role and
corresponding log types with the logTypes
field.
The following list shows the supported worker types that you can specify
for the logTypes
configuration:
- Spark
-
-
SPARK_DRIVER : ["STDERR", "STDOUT"]
-
SPARK_EXECUTOR : ["STDERR", "STDOUT"]
-
- Hive
-
-
HIVE_DRIVER : ["STDERR", "STDOUT", "HIVE_LOG", "TEZ_AM"]
-
TEZ_TASK : ["STDERR", "STDOUT", "SYSTEM_LOGS"]
-