Video understanding - Amazon Nova

Video understanding

The Amazon Nova models allow you to include a single video in the payload, which can be provided either in base64 format or through an Amazon S3 URI. When using the base64 method, the overall payload size must remain within 25 MB. However, you can specify an Amazon S3 URI for video understanding. This approach enables you to leverage the model for longer videos (up to 1 GB in size) without being constrained by the overall payload size limitation. Amazon Nova models can analyze the passed video and answer questions, classify a video, and summarize information in the video based on provided instructions.

Media File Type

File Formats supported

Input Method

Video

MP4, MOV, MKV, WebM, FLV, MPEG, MPG, WMV, 3GP

Base64

Recommended for payload size less than 25 MB

Amazon S3 URI

Recommended for payloads greater than 25 MB up to 2 GB. Individual files must be 1 GB or smaller.

There are no differences in the video input token count, regardless of whether the video is passed as base64 (as long as it fits within the size constraints) or via an Amazon S3 location.

Note that for 3GP file format, the "format" field passed in the API request should be of the format "three_gp".

When using Amazon S3, ensure that you are set the "Content-Type" metadata to the correct MIME type for the video.

Video size information

Amazon Nova video understanding capabilities support Multi-Aspect Ratio. All videos are resized with distortion (up or down, based on the input) to 672*672 square dimensions before feeding it to the model. The model utilizes a dynamic sampling strategy based on the length of the video. For Amazon Nova Lite and Amazon Nova Pro, with videos less than or equal to 16 minutes in duration, a 1 frame per second (FPS) sampling rate is employed. However, for videos exceeding 16 minutes in length, the sampling rate decreases in order to maintain a consistent 960 frames sampled, with the frame sampling rate varying accordingly. This approach is designed to provide more accurate scene-level video understanding for shorter videos compared to longer video content. We recommend that you keep the video length less than 1 hour for low motion, and less than 16 minutes for anything with higher motion. For Amazon Nova Premier, the 1 FPS sampling rate is applied up to a limit of 3,200 frames.

There should be no difference when analyzing a 4k version of a video and a Full HD version. Similarly, because the sampling rate is at most 1 FPS, a 60 FPS video should perform as well as a 30 FPS video. Because of the 1 GB limit in video size, using higher than required resolution and FPS is not beneficial and will limit the video length that fits in that size limit. You might want to pre-process videos longer than 1 GB.

Video tokens

The length of the video is main factor impacting the number of tokens generated. To calculate the approximate cost, you should multiply the estimated number of video tokens by the per-token price of the specific model being utilized.

The following table provides some approximations of frame sampling and token utilization per video length for Amazon Nova Pro, Lite, and Micro:

video_duration

10 sec

30 sec

16 min

20 min

30 min

45 min

1 hr

1.5 hr

frames_to_sample

10

30

960

960

960

960

960

960

sample_rate_fps

1

1

1

0.755

0.5

0.35556

0.14

0.096

Estimated token count

2,880

8,640

276,480

276,480

276,480

276,480

276,480

276,480

The following table provides some approximations of frame sampling and token utilization per video length for Amazon Nova Premier:

video_duration

10 sec

30 sec

16 min

20 min

30 min

45 min

1 hr

1.5 hr

frames_to_sample

10

30

960

1200

1800

2700

sample_rate_fps

1

1

1

1

1

1

Estimated token count

2,880

8,640

276,480

345,600

518,400

777,600

The following table provides some approximations of frame sampling and token utilization per video length for Amazon Nova Lite 1.5

video_duration

10 sec

30 sec

16 min

20 min

30 min

45 min

1 hr

1.5 hr

frames_to_sample

10

30

960

1200

1800

2700

sample_rate_fps

1

1

1

1

1

1

Estimated token count

2,880

8,640

276,480

345,600

518,400

777,600

Video understanding limitations

The following are key model limitations, where model accuracy and performance might not be guaranteed.

  • One video per request: currently the model supports only 1 video per request. Some frameworks and libraries use memory to keep track of previous interactions. There might be a video that was added in a previous context.

  • No audio support: The models are currently trained to process and understand video content solely based on the visual information in the video. They do not possess the capability to analyze or comprehend any audio components that are present in the video.

  • Temporal causality: The model has limited understanding of event causality across the progression of the video. Although it answers well to point in time questions, it does not perform as well on answers that depends on understanding a sequence of events

  • Multilingual image understanding: The models have limited understanding of multilingual images and video frames. They might struggle or hallucinate on similar tasks.

  • People identification: The Amazon Nova models do not support the capability to identify or name individuals in images, documents, or videos. The models will refuse to perform such tasks.

  • Spatial reasoning: The Amazon Nova models have limited spatial reasoning capabilities. They may struggle with tasks that require precise localization or layout analysis.

  • Small text in images or videos: If the text in the image or video is too small, consider increasing relative size of the text in the image by cropping to the relevant section while preserving necessary content.

  • Counting: The Amazon Nova models can provide approximate counts of objects in an image, but might not always be precisely accurate, especially when dealing with large numbers of small objects.

  • Inappropriate content: The Amazon Nova models will not process inappropriate or explicit images that violate the Acceptable Use Policy

  • Healthcare applications: Due to the sensitive nature of these artifacts, even though Amazon Nova models can give general analysis on healthcare images or videos, we do not recommend that you interpret complex diagnostic scans. The response of Amazon Nova should never be considered a substitute for professional medical advice.

Video understanding examples

The following examples show how to send video prompts to Amazon Nova models using different input methods.

The following example shows how to send a video prompt to Amazon Nova Model with InvokeModel.

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. # SPDX-License-Identifier: Apache-2.0 import base64 import boto3 import json # Create a Bedrock Runtime client in the AWS Region of your choice. client = boto3.client( "bedrock-runtime", region_name="us-east-1", ) MODEL_ID = "us.amazon.nova-lite-v1:0" # Open the image you'd like to use and encode it as a Base64 string. with open("media/cooking-quesadilla.mp4", "rb") as video_file: binary_data = video_file.read() base_64_encoded_data = base64.b64encode(binary_data) base64_string = base_64_encoded_data.decode("utf-8") # Define your system prompt(s). system_list= [ { "text": "You are an expert media analyst. When the user provides you with a video, provide 3 potential video titles" } ] # Define a "user" message including both the image and a text prompt. message_list = [ { "role": "user", "content": [ { "video": { "format": "mp4", "source": { "bytes": video // Binary array (Converse API) or Base64-encoded string (Invoke API) }, } }, { "text": "Provide video titles for this clip." }, ], } ] # Configure the inference parameters. inf_params = {"maxTokens": 300, "topP": 0.1, "topK": 20, "temperature": 0.3} native_request = { "schemaVersion": "messages-v1", "messages": message_list, "system": system_list, "inferenceConfig": inf_params, } # Invoke the model and extract the response body. response = client.invoke_model(modelId=MODEL_ID, body=json.dumps(native_request)) model_response = json.loads(response["body"].read()) # Pretty print the response JSON. print("[Full Response]") print(json.dumps(model_response, indent=2)) # Print the text content for easy readability. content_text = model_response["output"]["message"]["content"][0]["text"] print("\n[Response Content Text]") print(content_text)

The following example shows how to send a video using an Amazon S3 location to Amazon Nova with InvokeModel.

import base64 import boto3 import json # Create a Bedrock Runtime client in the AWS Region of your choice. client = boto3.client( "bedrock-runtime", region_name="us-east-1", ) MODEL_ID = "us.amazon.nova-lite-v1:0" # Define your system prompt(s). system_list = [ { "text": "You are an expert media analyst. When the user provides you with a video, provide 3 potential video titles" } ] # Define a "user" message including both the image and a text prompt. message_list = [ { "role": "user", "content": [ { "video": { "format": "mp4", "source": { "s3Location": { "uri": "s3://my_bucket/my_video.mp4", "bucketOwner": "111122223333" } } } }, { "text": "Provide video titles for this clip." } ] } ] # Configure the inference parameters. inf_params = {"maxTokens": 300, "topP": 0.1, "topK": 20, "temperature": 0.3} native_request = { "schemaVersion": "messages-v1", "messages": message_list, "system": system_list, "inferenceConfig": inf_params, } # Invoke the model and extract the response body. response = client.invoke_model(modelId=MODEL_ID, body=json.dumps(native_request)) model_response = json.loads(response["body"].read()) # Pretty print the response JSON. print("[Full Response]") print(json.dumps(model_response, indent=2)) # Print the text content for easy readability. content_text = model_response["output"]["message"]["content"][0]["text"] print("\n[Response Content Text]") print(content_text)