Set up a data source for your knowledge base
A data source contains files with information that can be retrieved when your knowledge base is queried. You set up the data source for your knowledge base by uploading source document files to an Amazon S3 bucket.
Check that each source document file conforms to the following requirements:
-
The file must be in one of the following supported formats:
Format Extension Plain text .txt Markdown .md HyperText Markup Language .html Microsoft Word document .doc/.docx Comma-separated values .csv Microsoft Excel spreadsheet .xls/.xlsx Portable Document .pdf -
The file size doesn't exceed the quota of 50 MB.
The following topics describe optional steps for preparing your data source.
Add metadata to your files to allow for filtering
You can optionally add metadata to files in your data source. Metadata allows for your data to be filtered during knowledge base query.
Metadata file requirements
To include metadata for a file in your data source, create a JSON file consisting of a metadataAttributes
field that maps to an object with a key-value pair for each metadata attribute. Then upload it to the same folder in your Amazon S3 bucket as the source document file. The following displays the general format of the metadata file:
{ "metadataAttributes": { "
${attribute1}
": "${value1}
", "${attribute2}
": "${value2}
", ... } }
The following data types are supported for the values of the attributes:
-
String
-
Number
-
Boolean
Check that each metadata file conforms to the following requirements:
-
The file has the same name as its associated source document file.
-
Append
.metadata.json
after the file extension (for example, if you have a file calledA.txt
, the metadata file must be namedA.txt.metadata.json
. -
The file size doesn't exceed the quota of 10 KB.
-
The file is in the same folder in the Amazon S3 bucket as its associated source document file.
Note
If you're adding metadata to an existing vector index in an Amazon OpenSearch Serverless vector store, check that the vector index is configured with the faiss
engine to allow for filtering. If the vector index is configured with the nmslib
engine, you'll have to do one of the following:
-
Create a new knowledge base in the console and let Amazon Bedrock automatically create a vector index in Amazon OpenSearch Serverless for you.
-
Create another vector index in the vector store and select
faiss
as the Engine. Then create a new knowledge base and specify the new vector index.
If you're adding metadata to an existing vector index in an Amazon Aurora database cluster, you must add a column to the table for each metadata attribute in your metadata files before starting ingestion. The metadata attribute values will be written to these columns.
After you sync your data source, you can filter results during knowledge base query.
Metadata file example
As an example, if you have a source document with the name oscars-coverage_20240310.pdf
that contains news articles, you might want to categorize them by attributes such as year
or genre
. To create the metadata for this file, perform the following steps:
-
Create a file named
oscars-coverage_20240310.pdf.metadata.json
with the following contents:{ "metadataAttributes": { "genre": "entertainment", "year": 2024 } }
-
Upload
oscars-coverage_20240310.pdf.metadata.json
to the same folder asoscars-coverage_20240310.pdf
in your Amazon S3 bucket. -
Create a knowledge base if you haven't yet. Then, sync your data source.
Source chunks
During ingestion of your data into a knowledge base, Amazon Bedrock splits each file into chunks. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried.
Amazon Bedrock offers chunking strategies that you can use to chunk your data. You can also pre-process your data by chunking your source files yourself. Consider which of the following chunking strategies you want to use for your data source:
-
Default chunking – By default, Amazon Bedrock automatically splits your source data into chunks, such that each chunk contains, at most, approximately 300 tokens. If a document contains less than 300 tokens, then it is not split any further.
-
Fixed size chunking – Amazon Bedrock splits your source data into chunks of the approximate size that you set.
-
No chunking – Amazon Bedrock treats each file as one chunk. If you choose this option, you may want to pre-process your documents by splitting them into separate files before uploading them to an Amazon S3 bucket.