Appendix E: State Machine Error Handling - Media2Cloud

Appendix E: State Machine Error Handling

All state machines have a built-in retry logic using the AWS Step Functions Catch and Retry to improve the resiliency of the workflow.

"Retry": [ { "ErrorEquals": [ "States.ALL" ], "IntervalSeconds": 1, "MaxAttempts": 4, "BackoffRate": 1.2 } ], "Catch": [ { "ErrorEquals": [ "States.ALL" ], "Next": "Labeling error" } ]

If all retries fail, the state machine generates an error to stop the execution. Amazon CloudWatch Events is configured to monitor for errors generated by the state machines. The CloudWatch Events pattern is defined as follows:

{ "detail-type": [ "Step Functions Execution Status Change" ], "source": [ "aws.states" ], "detail": { "stateMachineArn": [ "arn:aws:states:<region>:<account>:stateMachine:SO0050-e86000-ingest", "arn:aws:states:<region>:<account>:stateMachine:SO0050-e86000-analysis", "arn:aws:states:<region>:<account>stateMachine:SO0050-e86000-gt-labeling" ], "status": [ "FAILED", "ABORTED", "TIMED_OUT" ] } }

An AWS Lambda function then uses the StepFunctions GetExecutionHistory API to parse the last error of the state machine and publishes the error message to an Amazon Simple Notification Service (Amazon SNS) topic.