Vision understanding prompting best practices

포커스 모드

Vision understanding prompting best practices - Amazon Nova

이 페이지는 귀하의 언어로 번역되지 않았습니다. 번역 요청

The Amazon Nova model family is equipped with novel vision capabilities that enable the model to comprehend and analyze images and videos, thereby unlocking exciting opportunities for multimodal interaction. The following sections outline guidelines for working with images and videos in Amazon Nova. This includes best practices, code examples, and relevant limitations to consider.

The higher-quality images or videos that you provide, the greater the chances that the models will accurately understand the information in the media file. Ensure the images or videos are clear and free from excessive blurriness or pixelation to guarantee more accurate results. If the image or video frames contains important text information, verify that the text is legible and not too small. Avoid cropping out key visual context solely to enlarge the text.

Amazon Nova models allow you to include a single video in the payload, which can be provided either in base-64 format or through an Amazon S3 URI. When using the base-64 method, the overall payload size must be less than 25MB. However, you can specify an Amazon S3 URI for video understanding. Using Amazon S3 allows you to leverage the model for longer videos (up to 1GB in size) without being constrained by the overall payload size limitation. Amazon Nova can analyze the input video and answer questions, classify a video, and summarize information in the video based on provided instructions.

Amazon Nova models allow you to include multiple images in the payload. The total payload size can't exceed 25MB. Amazon Nova models can analyze the passed images and answer questions, classify an image, and summarize images based on provided instructions.

Image information
Media File Type	File Formats supported	Input Method
Image	PNG, JPG, JPEG, GIF, WebP	Base-64

Video information
Format	MIME Type	Video Encoding
MKV	video/x-matroska	H.264
MOV	video/quicktime	H.264 H.265 ProRES
MP4	video/mp4	DIVX/XVID H.264 H.265 J2K (JPEG2000) MPEG-2 MPEG-4 Part 2 VP9
WEBM	video/webm	VP8 VP9
FLV	video/x-flv	FLV1
MPEG	video/mpeg	MPEG-1
MPG	video/mpg	MPEG-1
WMV	video/wmv	MSMPEG4v3 (MP43)
3GPP	video/3gpp	H.264

There are no differences in the video input token count, regardless of whether the video is passed as base-64 (as long as it fits within the size constraints) or via an Amazon S3 location.

Note that for 3gp file format, the "format" field passed in the API request should be of the format "three_gp".

When using Amazon S3, ensure that your "Content-Type" metadata is set to the correct MIME type for the video

Topics

Long and high-motion videos

The model does video understanding by sampling videos frames at a base 1 frame per second (FPS). It is a balance between capturing details in the video and consuming input tokens utilized, which affects cost, latency, and maximum video length. While sampling one event every second should be enough for general use cases, some use cases on high motion videos such as sports videos might not perform well.

In order to handle longer videos, the sampling rate is decreased on videos longer than 16 minutes to a fixed 960 frames, spaced across the length of the video. This means that, as a video gets longer than 16 minutes, the lower the FPS and fewer details will be captured. This allows for use cases such as summarization of longer videos, but exacerbates issues with high motion videos where details are important.

In many cases, you can get a 1 FPS sampling on longer videos by using pre-processing steps and multiple calls. The video can be split into smaller segments, then each segment is analyzed using the multi-model capabilities of the model. The responses are aggregated and a final step using text-to-text generates a final answer. Note there can be loss of context when segmenting the videos this way. This is akin to the tradeoffs in chunking for RAG use cases and many of the same mitigation techniques transfer well, such as sliding-window.

Note that segmenting the video might also decrease latency as analysis is done in parallel, but can generate significantly more input tokens, which affect cost.

Latency

Videos can be large in size. Although we provide means to handle up to 1GB files by uploading them to Amazon S3, making invocation payloads very lean, the models still needs to process a potentially large number of tokens. If you are using synchronous Amazon Bedrock calls such as Invoke or Converse, make sure your SDK is configured with an appropriate timeout.

Regardless, Amazon S3 URI is the preferred way when latency is a factor. Segmenting videos as described in the previous section is another strategy. Pre-processing high-resolution and high-frame rate videos down can also save bandwidth and processing on the service size, lowering latency.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Troubleshooting tool calls

Vision understanding prompting techniques

이 페이지에서

쿠키 기본 설정 선택

쿠키 기본 설정 사용자 지정

필수

성능

기능

광고

쿠키 기본 설정을 저장할 수 없음

Vision understanding prompting best practices

Topics

Long and high-motion videos

Latency

이 페이지에서

페이지 내용이 도움이 되었습니까?

다음 주제:

이전 주제:

도움이 필요하십니까?