How content chunking and parsing works for Amazon Bedrock knowledge bases - Amazon Bedrock

How content chunking and parsing works for Amazon Bedrock knowledge bases

Amazon Bedrock first splits your documents or content into manageable chunks for efficient data retrieval. The chunks are then converted to embeddings and written to a vector index (vector representation of the data), while maintaining a mapping to the original document. The vector embeddings allow the texts to be mathematically compared for similarity.

Standard chunking

Amazon Bedrock supports the following standard approaches to chunking:

  • Fixed-size chunking: You can configure the desired chunk size by specifying the number of tokens per chunk, and an overlap percentage, providing flexibility to align with your specific requirements. You can set the maximum number of tokens that must not exceed for a chunk and the overlap percentage between consecutive chunks.

  • Default chunking: Splits content into text chunks of approximately 300 tokens. The chunking process honors sentence boundaries, ensuring that complete sentences are preserved within each chunk.

You can also choose no chunking for your documents. Each document is treated a single text chunk. You might want to pre-process your documents by splitting them into separate files before choosing no chunking as your chunking approach/strategy.

The following is an example of configuring fixed-sized chunking:

Console
  1. Sign in to the AWS Management Console using an IAM role with Amazon Bedrock permissions, and open the Amazon Bedrock console at https://console.aws.amazon.com/bedrock/.

  2. From the left navigation pane, select Knowledge bases.

  3. In the Knowledge bases section, select Create knowledge base.

  4. Provide the knowledge base details such as the name, IAM role for the necessary access permissions, and any tags you want to assign to your knowledge base.

  5. Choose a supported data source and provide the connection configuration details.

  6. For chunking and parsing configurations, first choose the custom option and then choose the fixed-size chunking as your chunking strategy.

  7. Continue the steps to complete creating your knowledge base.

API

{ ... "vectorIngestionConfiguration": { "chunkingConfiguration": { "chunkingStrategy": "string", "fixedSizeChunkingConfiguration": { "maxTokens": "number", "overlapPercentage": "number" } } } }

Hierarchical chunking

Hierarchical chunking involves organizing information into nested structures of child and parent chunks. When creating a data source, you are able to define the parent chunk size, child chunk size and the number of tokens overlapping between each chunk. During retrieval, the system initially retrieves child chunks, but replaces them with broader parent chunks so as to provide the model with more relevant chunks. This approach enhances efficiency and relevance by providing concise, higher-level summaries instead of granular details.

For hierarchical chunking, Amazon Bedrock knowledge bases supports specifying two levels or the following depth for chunking:

  • Parent: You set the maximum parent chunk token size.

  • Child: You set the maximum child chunk token size.

You also set the overlap tokens between chunks. This is the absolute number of overlap tokens between each parent chunk and parent with each child.

The following is an example of configuring hierarchical chunking:

Console
  1. Sign in to the AWS Management Console using an IAM role with Amazon Bedrock permissions, and open the Amazon Bedrock console at https://console.aws.amazon.com/bedrock/.

  2. From the left navigation pane, select Knowledge bases.

  3. In the Knowledge bases section, select Create knowledge base.

  4. Provide the knowledge base details such as the name, IAM role for the necessary access permissions, and any tags you want to assign to your knowledge base.

  5. Choose a supported data source and provide the connection configuration details.

  6. For chunking and parsing configurations, first choose the custom option and then choose hierarchical chunking as your chunking strategy.

  7. Enter the maximum parent chunk token size.

  8. Enter the maximum children chunk token size.

  9. Enter the overlap tokens between chunks. This is the absolute number of overlap tokens between each parent chunk and parent with each child.

  10. Continue the steps to complete creating your knowledge base.

API

{ ... "vectorIngestionConfiguration": { "chunkingConfiguration": { "chunkingStrategy": "string", "fixedSizeChunkingConfiguration": { "maxTokens": "number", "overlapPercentage": "number" }, "hierarchicalChunkingConfiguration": { // Hierarchical chunking "levelConfigurations": [{ "maxTokens": "number" }], "overlapTokens": "number" } } } }

Semantic chunking

Semantic chunking is a natural language processing technique that divides text into meaningful chunks to enhance understanding and information retrieval. It aims to improve retrieval accuracy by focusing on the semantic content rather than just syntactic structure. By doing so, it may facilitate more precise extraction and manipulation of relevant information. When configuring semantic chunking on your data source, you have the option to specify the following hyper parameters, including:

  • Maximum tokens: The maximum number of tokens that should be included in a single chunk, while honoring sentence boundaries.

  • Buffer size: For a given sentence, the buffer size defines the number of surrounding sentences to be added for embeddings creation. For example, a buffer size of 1 results in 3 sentences (current, previous and next sentence) to be combined and embedded. This parameter can influence how much text is examined together to determine the boundaries of each chunk, impacting the granularity and coherence of the resulting chunks. A larger buffer size might capture more context but can also introduce noise, while a smaller buffer size might miss important context but ensures more precise chunking.

  • Breakpoint percentile threshold: The breakpoint threshold is a parameter that determines where to divide the text into chunks based on semantic similarity. The threshold helps identify natural breaking points in the text to create coherent and meaningful chunks. Adjusting the breakpoint threshold can influence the size and content of each chunk, balancing between maintaining context and creating manageable units for processing.

The following is an example of configuring semantic chunking:

Console
  1. Sign in to the AWS Management Console using an IAM role with Amazon Bedrock permissions, and open the Amazon Bedrock console at https://console.aws.amazon.com/bedrock/.

  2. From the left navigation pane, select Knowledge bases.

  3. In the Knowledge bases section, select Create knowledge base.

  4. Provide the knowledge base details such as the name, IAM role for the necessary access permissions, and any tags you want to assign to your knowledge base.

  5. Choose a supported data source and provide the connection configuration details

  6. For chunking and parsing configurations, first choose the custom option and then choose semantic chunking as your chunking strategy.

  7. Enter the maximum number of sentences surrounding the target sentence to group together. Example: buffer size 1 is “sentence previous”, “sentence target”, “sentence next”.

  8. Enter the maximum token size for a text chunk.

  9. Select the breakpoint threshold for similarity between sentence groups. For example, a breakpoint threshold of 90% results in the creation of a new chunk when its embeddings similarity falls below 90%.

  10. Continue the steps to complete creating your knowledge base.

API

{ ... "vectorIngestionConfiguration": { "chunkingConfiguration": { "chunkingStrategy": "string", "fixedSizeChunkingConfiguration": { "maxTokens": "number", "overlapPercentage": "number" }, "semanticChunkingConfiguration": { // Semantic chunking "maxTokens": "number", "bufferSize": "number", "breakpointPercentileThreshold": "number" } } } }

Advanced parsing options

You can use advanced parsing techniques for parsing non-textual information from supported file types, such as PDF. This feature allows you to select a foundation model for parsing of complex data, such as tables and charts. Additionally, you can tailor this to your specific needs by overwriting the default prompts for data extraction, ensuring optimal performance across a diverse set of use cases. Currently, Claude 3 Sonnet and Claude 3 Haiku are supported.

The following is an example of configuring a foundational model to aid in advanced parsing:

Console
  • Sign in to the AWS Management Console using an IAM role with Amazon Bedrock permissions, and open the Amazon Bedrock console at https://console.aws.amazon.com/bedrock/.

  • From the left navigation pane, select Knowledge bases.

  • In the Knowledge bases section, select Create knowledge base.

  • Provide the knowledge base details such as the name, IAM role for the necessary access permissions, and any tags you want to assign to your knowledge base.

  • Choose a supported data source and provide the connection configuration details.

  • For chunking and parsing configurations, first choose the custom option and then enable Foundation model and select your preferred foundation model. You can also optionally overwrite the Instructions for the parser to suit your specific needs.

  • Continue the steps to complete creating your knowledge base.

API

{ ... "vectorIngestionConfiguration": { "chunkingConfiguration": { "chunkingStrategy": "string", "fixedSizeChunkingConfiguration": { "maxTokens": "number", "overlapPercentage": "number" } }, "parsingConfiguration": { // Parse tabular data within docs "parsingStrategy": "string", // enum of BEDROCK_FOUNDATION_MODEL "bedrockFoundationModelConfiguration": { "parsingPrompt": { "parsingPromptText": "string" }, "modelArn": "string" } } } }

Metadata selection for CSVs

When ingesting CSV (comma separate values) files, you have the ability to have the knowledge base treat certain columns as content fields versus metadata fields. Instead of potentially having hundreds or thousands of content/metadata file pairs, you can now have a single CSV file and a corresponding metadata.json file, giving the knowledge base hints as to how to treat each column inside of your CSV. To do this, ensure that:

  • Your CSV is in RFC4180 format.

  • The first row of your CSV includes header information.

  • Metadata fields provided in your metadata.json are present as columns in your CSV.

  • You provide a metadata.json file with the following format:

    { "metadataAttributes": { "${attribute1}": "${value1}", "${attribute2}": "${value2}", ... }, "documentStructureConfiguration": { "type": "RECORD_BASED_STRUCTURE_METADATA", "recordBasedStructureMetadata": { "contentFields": [ { "fieldName": "string" } ], "metadataFieldsSpecification": { "fieldsToInclude": [ { "fieldName": "string" } ], "fieldsToExclude": [ { "fieldName": "string" } ] } } } }

Note that:

  • Amazon Bedrock knowledge bases currently support one content field.

  • If no inclusion/exclusion fields are provided, all columns are treated as metadata columns, except the content column.

  • If only inclusion fields are provided, only the provided columns are treated as metadata.

  • If only exclusion fields are provided, all columns, except the exclusion columns are treated as metadata.

  • The knowledge base will skip over and ignore any blank rows found inside a CSV.

Custom transformation

You have the ability to define a custom transformation Lambda function to inject your own logic into the knowledge base ingestion process.

You may have specific chunking logic, not natively supported by Amazon Bedrock knowledge bases. Select the No chunking strategy, while specifying a Lambda function that contains your chunking logic. Additionally, you’ll need to specify an Amazon S3 bucket for the knowledge base to write files to be chunked by your Lambda function. After chunking, your Lambda function will write back chunked files into the same bucket and return references for the knowledge base for further processing. You optionally have the ability to provide your own AWS KMS key for encryption of files being stored into your S3 bucket.

Alternatively, you may want to specify chunk-level metadata, while having the knowledge base apply one of the natively supported chunking strategies. In this case, select one of the pre-defined chunking strategies (for example Default, or Fixed-size), while providing a reference to your Lambda function and S3 bucket. In this case, the knowledge base will store parsed and pre-chunked files in the pre-defined S3 bucket, before calling your Lambda function for further adding chunk-level metadata. After adding chunk-level metadata, your Lambda function will write back chunked files into the same bucket and return references for the knowledge base for further processing. Please note that chunk-level metadata take precedence and overwrite file-level metadata, in case of any collisions.

For API and file contracts, refer the the below structures:

API contract when adding a custom transformation using Lambda function

{ ... "vectorIngestionConfiguration": { "customTransformationConfiguration": { // Custom transformation "intermediateStorage": { "s3Location": { // the location where input/output of the Lambda is expected "uri": "string" } }, "transformations": [ { "transformationFunction": { "transformationLambdaConfiguration": { "lambdaArn": "string" } }, "stepToApply": "string" // enum of POST_CHUNKING } ] }, "chunkingConfiguration": { "chunkingStrategy": "string", "fixedSizeChunkingConfiguration": { "maxTokens": "number", "overlapPercentage": "number" } ... } }

Custom Lambda transformation input format

{ "version": "1.0", "knowledgeBaseId": "string", "dataSourceId": "string", "ingestionJobId": "string", "bucketName": "string", "priorTask": "string", "inputFiles": [ { "originalFileLocation": { "type": "S3", "s3_location": { "key": "string", "uri": "string" } }, "fileMetadata": { "key1": "value1", "key2": "value2" }, "contentBatches": [ { "key":"string" } ] } ] }

Custom Lambda transformation output format

{ "outputFiles": [ { "originalFileLocation": { "type": "S3", "s3_location": { "key": "string", "uri": "string" } } "fileMetadata": { "key1": "value1", "key2": "value2" }, "contentBatches": [ { "key": "string" } ] } ] }

File format for objects in referenced in fileContents

{ "fileContents": [ { "contentBody": "...", "contentType": "string", // enum of TEXT, PDF, ... "contentMetadata": { "key1": "value1", "key2": "value2" } } ... ] }