Troubleshooting and Debugging Workflows

This section provides information about how to troubleshoot your workflow executions. It includes strategies for examining and replaying workflows, and lists some common causes of errors in workflow executions.

Examining Workflow Executions with the AWS Management Console#

The first step in troubleshooting a workflow execution is to use the AWS Management Console to look at the workflow history. The workflow history is a complete and authoritative record of all the events that changed the execution state of the workflow execution. This history is maintained by Amazon SWF and is invaluable for diagnosing problems. The Amazon SWF console enables you to search for workflow executions and drill down into individual history events.

To learn more about using the AWS Management Console with Amazon SWF, see Managing Your Workflow Executions in the Amazon SWF Developer Guide.

Using the WorkflowReplayer Class#

The AWS Flow Framework provides a Replayer::WorkflowReplayer that you can use to replay a workflow execution locally and debug it. Using this class, you can debug closed and running workflow executions. WorkflowReplayer relies on the history stored in Amazon SWF to perform the replay. You can point it to a workflow execution in your Amazon SWF account.

When you replay a workflow execution using WorkflowReplayer, it does not impact the workflow execution running in your account: the replay is done completely on the client. You can debug the workflow, create breakpoints, and step into code using your debugging tools as usual.

For example, the following code snippet can be used to replay a workflow execution:

require 'replayer'

# Create an instance of the replayer with the required options
replayer = AWS::Flow::Replayer::WorkflowReplayer.new(
  domain: '<domain_name>',
  execution: {
    workflow_id: "<workflow_id",
    run_id: "<run_id>"
  },
  workflow_class: YourWorkflowClass
)

# Call the replay method with the replay_upto event_id number -
decision = replayer.replay(20)

puts decision.inspect

Common Causes of Errors in Workflow Executions#

Unknown Resource Fault#

Amazon SWF returns an unknown resource fault when you try to perform an operation on a resource that is not available. The common causes for this fault are:

  • You configure a worker with a domain that does not exist. To fix this, first register the domain using the Amazon SWF console or with the Amazon SWF service API.
  • You try to create workflow execution or activity tasks of types that have not been registered. This can happen if you try to create the workflow execution before the workers have been run. Since workers register their types when they are run for the first time, you must run them at least once before attempting to start executions (or manually register the types using the AWS Management Console or the service API). Note that once types have been registered, you can create executions even if no worker is running.
  • A worker attempts to complete a task that has already timed out. For example, if a worker takes too long to process a task and exceeds a timeout, it will get an UnknownResource fault when it attempts to complete or fail the task. The AWS Flow Framework workers will continue to poll Amazon SWF and process additional tasks. However, you should consider adjusting the timeout. Adjusting the timeout requires that you register a new version of the activity type.

Non Deterministic Workflows#

The implementation of your workflow must be deterministic. Some common mistakes that can lead to nondeterminism are:

  • Use of the system clock
  • Use of random numbers
  • Generation of GUIDs

Since these constructs may return different values at different times, the control flow of your workflow may take different paths each time it is executed. If the framework detects nondeterminism while executing the workflow, an exception will be thrown.

Problems Due to Versioning#

When you implement a new version of your workflow or activity—for instance, when you add a new feature—you should change the version string of the type by providing a new version when declaring your workflow or activity type.

When new versions of a workflow are deployed, you might have executions of the existing version that are still running. Therefore, you need to make sure that workers get tasks that match the correct version of your workflow and activities. One way to accomplish this is by using a different set of task lists for each version. For example, you can append the version string to the name of a task list. This ensures that tasks belonging to different versions of the workflow and activities are assigned to the appropriate workers.

Lost Tasks#

Sometimes you may shut down workers and start new ones in quick succession only to discover that tasks get delivered to the old workers. This can happen due to race conditions in the system, which is distributed across several processes. The problem can also appear when you are running unit tests in a tight loop.

To make sure that the problem is, in fact, due to old workers getting tasks, you should look at the workflow history to determine which process received the task that you expected the new worker to receive. For example, the DecisionTaskStarted event in the workflow history contains the identity of the workflow worker that received the task. The id used by the AWS Flow Framework is of the form: {processId}@{host name}. Here is an example of the details for a DecisionTaskStarted event in the Amazon SWF console for a sample execution:

Event Timestamp Mon Feb 20 11:52:40 GMT-800 2012
Identity 2276@ip-0A6C1DF5
Scheduled Event Id 33

In order to avoid this situation, use different task lists for each test. Also, consider adding a delay between shutting down old workers and starting new ones.