쿠키 기본 설정 선택

당사는 사이트와 서비스를 제공하는 데 필요한 필수 쿠키 및 유사한 도구를 사용합니다. 고객이 사이트를 어떻게 사용하는지 파악하고 개선할 수 있도록 성능 쿠키를 사용해 익명의 통계를 수집합니다. 필수 쿠키는 비활성화할 수 없지만 '사용자 지정' 또는 ‘거부’를 클릭하여 성능 쿠키를 거부할 수 있습니다.

사용자가 동의하는 경우 AWS와 승인된 제3자도 쿠키를 사용하여 유용한 사이트 기능을 제공하고, 사용자의 기본 설정을 기억하고, 관련 광고를 비롯한 관련 콘텐츠를 표시합니다. 필수가 아닌 모든 쿠키를 수락하거나 거부하려면 ‘수락’ 또는 ‘거부’를 클릭하세요. 더 자세한 내용을 선택하려면 ‘사용자 정의’를 클릭하세요.

Vision understanding prompting best practices

포커스 모드
Vision understanding prompting best practices - Amazon Nova
이 페이지는 귀하의 언어로 번역되지 않았습니다. 번역 요청

The Amazon Nova model family is equipped with novel vision capabilities that enable the model to comprehend and analyze images and videos, thereby unlocking exciting opportunities for multimodal interaction. The following sections outline guidelines for working with images and videos in Amazon Nova. This includes best practices, code examples, and relevant limitations to consider.

The higher-quality images or videos that you provide, the greater the chances that the models will accurately understand the information in the media file. Ensure the images or videos are clear and free from excessive blurriness or pixelation to guarantee more accurate results. If the image or video frames contains important text information, verify that the text is legible and not too small. Avoid cropping out key visual context solely to enlarge the text.

Amazon Nova models allow you to include a single video in the payload, which can be provided either in base-64 format or through an Amazon S3 URI. When using the base-64 method, the overall payload size must be less than 25MB. However, you can specify an Amazon S3 URI for video understanding. Using Amazon S3 allows you to leverage the model for longer videos (up to 1GB in size) without being constrained by the overall payload size limitation. Amazon Nova can analyze the input video and answer questions, classify a video, and summarize information in the video based on provided instructions.

Amazon Nova models allow you to include multiple images in the payload. The total payload size can't exceed 25MB. Amazon Nova models can analyze the passed images and answer questions, classify an image, and summarize images based on provided instructions.

Image information

Media File Type

File Formats supported

Input Method

Image

PNG, JPG, JPEG, GIF, WebP

Base-64

Video information

Format

MIME Type

Video Encoding

MKV

video/x-matroska

H.264

MOV

video/quicktime

H.264

H.265

ProRES

MP4

video/mp4

DIVX/XVID

H.264

H.265

J2K (JPEG2000)

MPEG-2

MPEG-4 Part 2

VP9

WEBM

video/webm

VP8

VP9

FLV

video/x-flv

FLV1

MPEG

video/mpeg

MPEG-1

MPG

video/mpg

MPEG-1

WMV

video/wmv

MSMPEG4v3 (MP43)

3GPP

video/3gpp

H.264

There are no differences in the video input token count, regardless of whether the video is passed as base-64 (as long as it fits within the size constraints) or via an Amazon S3 location.

Note that for 3gp file format, the "format" field passed in the API request should be of the format "three_gp".

When using Amazon S3, ensure that your "Content-Type" metadata is set to the correct MIME type for the video

Long and high-motion videos

The model does video understanding by sampling videos frames at a base 1 frame per second (FPS). It is a balance between capturing details in the video and consuming input tokens utilized, which affects cost, latency, and maximum video length. While sampling one event every second should be enough for general use cases, some use cases on high motion videos such as sports videos might not perform well.

In order to handle longer videos, the sampling rate is decreased on videos longer than 16 minutes to a fixed 960 frames, spaced across the length of the video. This means that, as a video gets longer than 16 minutes, the lower the FPS and fewer details will be captured. This allows for use cases such as summarization of longer videos, but exacerbates issues with high motion videos where details are important.

In many cases, you can get a 1 FPS sampling on longer videos by using pre-processing steps and multiple calls. The video can be split into smaller segments, then each segment is analyzed using the multi-model capabilities of the model. The responses are aggregated and a final step using text-to-text generates a final answer. Note there can be loss of context when segmenting the videos this way. This is akin to the tradeoffs in chunking for RAG use cases and many of the same mitigation techniques transfer well, such as sliding-window.

Note that segmenting the video might also decrease latency as analysis is done in parallel, but can generate significantly more input tokens, which affect cost.

Latency

Videos can be large in size. Although we provide means to handle up to 1GB files by uploading them to Amazon S3, making invocation payloads very lean, the models still needs to process a potentially large number of tokens. If you are using synchronous Amazon Bedrock calls such as Invoke or Converse, make sure your SDK is configured with an appropriate timeout.

Regardless, Amazon S3 URI is the preferred way when latency is a factor. Segmenting videos as described in the previous section is another strategy. Pre-processing high-resolution and high-frame rate videos down can also save bandwidth and processing on the service size, lowering latency.

이 페이지에서

프라이버시사이트 이용 약관쿠키 기본 설정
© 2025, Amazon Web Services, Inc. 또는 계열사. All rights reserved.