Serverless Event Submission with Status Updates - Serverless Applications Lens

This whitepaper is in the process of being updated.

Serverless Event Submission with Status Updates

Suppose you have an ecommerce site and a customer submits an order that kicks off an inventory deduction and shipment process; or an enterprise application that submits a large query that may take minutes to respond.

The processes required to complete this common transaction may require multiple service calls that may take a couple of minutes to complete. Within those calls, you want to safeguard against potential failures by adding retries and exponential backoffs, However, that can cause a suboptimal user experience for whoever is waiting for the transaction to complete.

For long and complex workflows similar to this, you can integrate API Gateway or AWS AppSync with Step Functions that upon new authorized requests will start this business workflow. Step Functions responds immediately with an execution ID to the caller (Mobile App, SDK, web service, etc.).

For legacy systems, you can use the execution ID to poll Step Functions for the business workflow status via another REST API. With WebSockets whether you’re using REST or GraphQL, you can receive business workflow status in real-time by providing updates in every step of the workflow.

Figure 24: Asynchronous workflow with Step Functions state machines

Another common scenario is integrating API Gateway directly with SQS or Kinesis as a scaling layer. A Lambda function would only be necessary if additional business information or a custom request ID format is expected from the caller.

Figure 25: Asynchronous workflow using a queue as a scaling layer

In this second example, SQS serves multiple purposes:

  1. Storing the request record durably is important because the client can confidently proceed throughout the workflow knowing that the request will eventually be processed

  2. Upon a burst of events that may temporarily overwhelm the backend, the request can be polled for processing when resources become available.

Compared to the first example without a queue, Step Functions is storing the data durably without the need for a queue or state-tracking data sources. In both examples, the best practice is to pursue an asynchronous workflow after the client submits the request and avoiding the resulting response as blocking code if completion can take several minutes.

With WebSockets, AWS AppSync provides this capability out of the box via GraphQL subscriptions. With subscriptions, an authorized client could listen for data mutations they’re interested in. This is ideal for data that is streaming or may yield more than a single response.

With AWS AppSync, as status updates change in DynamoDB, clients can automatically subscribe and receive updates as they occur and it’s the perfect pattern for when data drives the user interface.

Figure 26: Asynchronous updates via WebSockets with AWS AppSync and GraphQL

Web Hooks can be implemented with SNS Topic HTTP subscriptions. Consumers can host an HTTP endpoint that SNS will call back via a POST method upon an event (for example, a data file arriving in Amazon S3). This pattern is ideal when the clients are configurable such as another microservice, which could host an endpoint. Alternatively, Step Functions supports callbacks where a state machine will block until it receives a response for a given task.

Figure 27: Asynchronous notification via Webhook with SNS

Lastly, polling could be costly from both a cost- and resource-perspective due to multiple clients constantly polling an API for status. If polling is the only option due to environment constraints, it’s a best practice to establish SLAs with the clients to limit the number of “empty polls”.

Figure 28: Client polling for updates on transaction recently made

For example, if a large data warehouse query takes an average of two minutes for a response, the client should poll the API after two minutes with exponential backoff if the data is not available. There are two common patterns to ensure that clients aren’t polling more frequently than expected: Throttling and Timestamp for when is safe to poll again.

For timestamps, the system being polled can return an extra field with a timestamp or time period as to when it is safe for the consumer to poll once again. This approach follows an optimistic scenario where the consumer will respect and use this wisely and in the event of abuse you can also employ throttling for a more complete implementation.