Set up a data source for your knowledge base - Amazon Bedrock

Set up a data source for your knowledge base

A data source contains files with information that can be retrieved when your knowledge base is queried. You set up the data source for your knowledge base by uploading source document files to an Amazon S3 bucket.

Check that each source document file conforms to the following requirements:

  • The file must be in one of the following supported formats:

    Format Extension
    Plain text .txt
    Markdown .md
    HyperText Markup Language .html
    Microsoft Word document .doc/.docx
    Comma-separated values .csv
    Microsoft Excel spreadsheet .xls/.xlsx
    Portable Document .pdf
  • The file size doesn't exceed the quota of 50 MB.

The following topics describe optional steps for preparing your data source.

Add metadata to your files to allow for filtering

You can optionally add metadata to files in your data source. Metadata allows for your data to be filtered during knowledge base query.

Metadata file requirements

To include metadata for a file in your data source, create a JSON file consisting of a metadataAttributes field that maps to an object with a key-value pair for each metadata attribute. Then upload it to the same folder in your Amazon S3 bucket as the source document file. The following displays the general format of the metadata file:

{ "metadataAttributes": { "${attribute1}": "${value1}", "${attribute2}": "${value2}", ... } }

The following data types are supported for the values of the attributes:

  • String

  • Number

  • Boolean

Check that each metadata file conforms to the following requirements:

  • The file has the same name as its associated source document file.

  • Append .metadata.json after the file extension (for example, if you have a file called A.txt, the metadata file must be named A.txt.metadata.json.

  • The file size doesn't exceed the quota of 10 KB.

  • The file is in the same folder in the Amazon S3 bucket as its associated source document file.

Note

If you're adding metadata to an existing vector index in an Amazon OpenSearch Serverless vector store, check that the vector index is configured with the faiss engine to allow for filtering. If the vector index is configured with the nmslib engine, you'll have to do one of the following:

If you're adding metadata to an existing vector index in an Amazon Aurora database cluster, you must add a column to the table for each metadata attribute in your metadata files before starting ingestion. The metadata attribute values will be written to these columns.

After you sync your data source, you can filter results during knowledge base query.

Metadata file example

As an example, if you have a source document with the name oscars-coverage_20240310.pdf that contains news articles, you might want to categorize them by attributes such as year or genre. To create the metadata for this file, perform the following steps:

  1. Create a file named oscars-coverage_20240310.pdf.metadata.json with the following contents:

    { "metadataAttributes": { "genre": "entertainment", "year": 2024 } }
  2. Upload oscars-coverage_20240310.pdf.metadata.json to the same folder as oscars-coverage_20240310.pdf in your Amazon S3 bucket.

  3. Create a knowledge base if you haven't yet. Then, sync your data source.

Source chunks

During ingestion of your data into a knowledge base, Amazon Bedrock splits each file into chunks. A chunk refers to an excerpt from a data source that is returned when the knowledge base that it belongs to is queried.

Amazon Bedrock offers chunking strategies that you can use to chunk your data. You can also pre-process your data by chunking your source files yourself. Consider which of the following chunking strategies you want to use for your data source:

  • Default chunking – By default, Amazon Bedrock automatically splits your source data into chunks, such that each chunk contains, at most, approximately 300 tokens. If a document contains less than 300 tokens, then it is not split any further.

  • Fixed size chunking – Amazon Bedrock splits your source data into chunks of the approximate size that you set.

  • No chunking – Amazon Bedrock treats each file as one chunk. If you choose this option, you may want to pre-process your documents by splitting them into separate files before uploading them to an Amazon S3 bucket.