Monitoring jobs - Amazon EMR

Monitoring jobs

Monitor jobs with Amazon CloudWatch Events

Amazon EMR on EKS emits events when the state of a job run changes. Each event provides information, such as the date and time when the event occurred, along with further details about the event, such as the virtual cluster ID and the ID of the job run that was affected.

You can use events to track the activity and health of a jobs that you run on a virtual cluster. You can also use Amazon CloudWatch Events to define an action to take when a job run generates an event that matches a pattern that you specify. Events are useful for monitoring a specific occurrence during the lifecycle of a job run. For example, you can monitor when a job run changes state from submitted to running. For more information about CloudWatch Events, see the Amazon CloudWatch Events User Guide.

The following table lists Amazon EMR on EKS events along with the state or state change that the event indicates, the severity of the event, and event messages. Each event is represented as a JSON object that is sent automatically to an event stream. The JSON object includes further details about the event. The JSON object is particularly important when you set up rules for event processing using CloudWatch Events because rules seek to match patterns in the JSON object. For more information, see Events and Event Patterns and Amazon EMR on EKS Events in the Amazon CloudWatch Events User Guide.

Job run state change events
State Severity Message
SUBMITTED INFO Job Run JobRunId (JobRunName) was successfully submitted to virtual cluster VirtualClusterId at Time UTC.
RUNNING INFO Job Run JobRunId (JobRunName) in virtual cluster VirtualClusterId started running at Time.
COMPLETED INFO Job Run jobRunId (JobRunName) in virtual cluster VirtualClusterId completed at Time. The Job Run started running at Time and took Num minutes to complete.
CANCELLED WARN Cancellation request has succeeded for Job Run JobRunId (JobRunName) in virtual cluster VirtualClusterId at Time and the Job Run is now cancelled.
FAILED ERROR Job Run JobRunId (JobRunName) in virtual cluster VirtualClusterId failed at Time.

Automate Amazon EMR on EKS with CloudWatch Events

You can use Amazon CloudWatch Events to automate your AWS services to respond to system events such as application availability issues or resource changes. Events from AWS services are delivered to CloudWatch Events in near real time. You can write simple rules to indicate which events are of interest to you and what automated actions to take when an event matches a rule. The actions that can be automatically triggered include the following:

  • Invoking an AWS Lambda function

  • Invoking Amazon EC2 Run Command

  • Relaying the event to Amazon Kinesis Data Streams

  • Activating an AWS Step Functions state machine

  • Notifying an Amazon Simple Notification Service (SNS) topic or an Amazon Simple Queue Service (SQS) queue

Some examples of using CloudWatch Events with Amazon EMR on EKS include the following:

  • Activating a Lambda function when a job run succeeds

  • Notifying an Amazon SNS topic when a job run fails

CloudWatch Events for "detail-type:" "EMR Job Run State Change" are generated by Amazon EMR on EKS for SUBMITTED, RUNNING, CANCELLED, FAILED and COMPLETED state changes.

Example: Set up a rule that invokes Lambda

Use the following steps to set up a CloudWatch Events rule that invokes Lambda when there is an "EMR Job Run State Change" event.

aws events put-rule \ --name cwe-test \ --event-pattern '{"detail-type": ["EMR Job Run State Change"]}'

Add the Lambda function that you own as a new target and give CloudWatch Events permission to invoke the Lambda function as follows. Replace 123456789012 with your account ID.

aws events put-targets \ --rule cwe-test \ --targets Id=1,Arn=arn:aws:lambda:us-east-1:123456789012:function:MyFunction
aws lambda add-permission \ --function-name MyFunction \ --statement-id MyId \ --action 'lambda:InvokeFunction' \ --principal

You cannot write a program that depends on the order or existence of notification events, as they might be out of sequence or missing. Events are emitted on a best effort basis.