Configure streaming response behavior - Amazon Bedrock

Configure streaming response behavior

The InvokeModelWithResponseStream API returns data in a streaming format. This allows you to access responses in chunks without waiting for the entire result. When using guardrails with a streaming response, there are two modes of operation: synchronous and asynchronous.

Synchronous mode

In the default synchronous mode, guardrails will buffer and apply the configured policies to one or more response chunks before the repsonse is sent back to the user. The synchronous processing mode introduces some latency to the response chunks, as it means that the response is delayed until the guardrails scan completes. However, it provides better accuracy, as every response chunk is scanned by guardrails before being sent to the user.

Asynchronous mode

In asynchronous mode, guardrails sends the response chunks to the user as soon as they become available, while asyncronously applying the configured policies in the background. The advantage is that response chunks are provided immediately with no latency impact, but response chunks may contain inappropriate content until guardrails scan completes. As soon as inappropriate content is identified, subsequent chunks will be blocked by guardrails.

Warning

Masking of sensitive information in model responses may be severely impacted in asynchronous mode as the original response may be returned to the user prior to the detection and masking of any sensitive content in the model response by the guardrail. Therefore, for such use cases, asynchronous mode is not recommended.

Enabling asynchronous mode

To enable asynchronous mode, you need to include the streamProcessingMode parameter in the amazon-bedrock-guardrailConfig object of your InvokeModelWithResponseStream request:

{ "amazon-bedrock-guardrailConfig": { "streamProcessingMode": "ASYNCHRONOUS" } }

By understanding the trade-offs between the synchronous and asynchronous modes, you can choose the appropriate mode based on your application's requirements for latency and content moderation accuracy.