Crawl web pages for your Amazon Bedrock knowledge base
Crawling web URLs as your data source is in preview release and is subject to change.
The Amazon Bedrock provided Web Crawler connects to and crawls URLs you have selected for use in your Amazon Bedrock knowledge base.
You can crawl website pages in accordance with your set scope or limits for your selected URLs. You can crawl
website pages using either the AWS Management Console for Amazon Bedrock or the
CreateDataSource
API (see Amazon Bedrock supported SDKs and AWS CLI).
When selecting websites to crawl, you must adhere to the
Amazon Acceptable Use Policy
and all other Amazon terms. Remember that you must only use the Web Crawler to
index your own web pages, or web pages that you have authorization to crawl.
The Web Crawler respects robots.txt in accordance with the
RFC 9309
There are limits to how many web page content items and MB per content item that can be crawled. See Quotas for knowledge bases.
Supported features
The Web Crawler connects to and crawls HTML pages starting from the seed URL, traversing all child links under the same
top primary domain and path. If any of the HTML pages reference supported documents, the Web Crawler will
fetch these documents, regardless if they are within the same top primary domain. You can modify the crawling behavior
by changing the crawling configuration - see Connection configuration.
The following is supported for you to:
-
Select multiple URLs to crawl
-
Respect standard robots.txt directives like 'Allow' and 'Disallow'
-
Limit the scope of the URLs to crawl and optionally exclude URLs that match a filter pattern
-
Limit the rate of crawling URLs
-
View the status of URLs visited while crawling in Amazon CloudWatch
Prerequisites
To use the Web Crawler, make sure you:.
-
Check that you are authorized to crawl your source URLs.
-
Check the path to robots.txt corresponding to your source URLs doesn't
block the URLs from being crawled. The Web Crawler adheres to the standards of
robots.txt: disallow
by default if robots.txt is not found
for the website. The Web Crawler respects robots.txt in accordance with the
RFC 9309.
-
Check if your source URL pages are JavaScript dynamically generated,
as crawling dynamically generated content is currently not supported.
You can check this by entering this in your browser:
view-source:https://examplesite.com/site/
.
If the body
element contains only a div
element
and few or no a href
elements, then the page is likely
generated dynamically. You can disable JavaScript in your browser, reload
the web page, and observe whether content is rendered properly and contains
links to your web pages of interest.
-
Enable CloudWatch Logs delivery to view the status of your data ingestion job for
ingesting web content, and if certain URLs cannot be retrieved.
When selecting websites to crawl, you must adhere to the
Amazon Acceptable Use Policy
and all other Amazon terms. Remember that you must only use the Web Crawler to
index your own web pages, or web pages that you have authorization to crawl.
Connection configuration
For more information about sync scope for crawling URLs,
inclusion/exclusion filters, URL access, incremental syncing, and how these work,
select the following:
You can limit the scope of the URLs to crawl based on each page URL's specific relationship to the
seed URLs. For faster crawls, you can limit URLs to those with the same host and initial URL path
of the seed URL. For more broader crawls, you can choose to crawl URLs with the same host or
within any subdomain of the seed URL.
You can choose from the following options.
-
Default: Limit crawling to web pages that belong to the same host and with the
same initial URL path. For example, with a seed URL of
"https://aws.amazon.com/bedrock/" then only this path and web pages that extend
from this path will be crawled, like "https://aws.amazon.com/bedrock/agents/".
Sibling URLs like "https://aws.amazon.com/ec2/" are not crawled, for example.
-
Host only: Limit crawling to web pages that belong to the same host. For example, with a
seed URL of "https://aws.amazon.com/bedrock/", then web pages with
"https://docs.aws.amazon.com" will also be crawled, like
"https://aws.amazon.com/ec2".
-
Subdomains: Include crawling of any web page that has the same primary domain as
the seed URL. For example, with a seed URL of
"https://aws.amazon.com/bedrock/" then any web page that
contains "amazon.com" (subdomain) will be crawled, like "https://www.amazon.com".
Make sure you are not crawling potentially excessive web pages. It's not recommended to
crawl large websites, such as wikipedia.org, without filters or scope limits. Crawling
large websites will take a very long time to crawl.
You can include or exclude certain URLs in accordance with your scope.
Supported file types are crawled regardless of scope and if there's
no exclusion pattern for the file type. If you specify an inclusion and exclusion
filter and both match a URL, the exclusion filter takes precedence and the
web content isn’t crawled.
An example of a regular expression filter pattern to exclude URLs that end with ".pdf" or PDF
web page attachments: ".*\.pdf$"
You can use the Web Crawler to crawl the pages of websites that you are authorized to crawl.
When selecting websites to crawl, you must adhere to the
Amazon Acceptable Use Policy
and all other Amazon terms. Remember that you must only use the Web Crawler to
index your own web pages, or web pages that you have authorization to crawl.
The Web Crawler respects robots.txt in accordance with the
RFC 9309
Each time the the Web Crawler runs, it retrieves content for all URLs that are reachable from the source
URLs and which match the scope and filters. For incremental syncs after the first sync of all content, Amazon Bedrock will update your
knowledge base with new and modified content, and will remove old content that is no longer present. Occasionally, the
crawler may not be able to tell if content was removed from the website; and in this case it will err on the side
of preserving old content in your knowledge base.
To sync your data source with your knowledge base, use the StartIngestionJob API or select your knowledge
base in the console and select Sync within the data source overview section.
All data that you sync from your data source becomes available to anyone with
bedrock:Retrieve
permissions to retrieve the data. This can also include any data with controlled
data source permissions. For more
information, see Knowledge base permissions.
- Console
-
The following steps configure Web Crawler for your Amazon Bedrock knowledge base.
You configure Web Crawler as part of the knowledge base creation steps in the console.
-
Sign in to the AWS Management Console using an IAM role with Amazon Bedrock permissions, and open the Amazon Bedrock console at
https://console.aws.amazon.com/bedrock/.
-
From the left navigation pane, select Knowledge bases.
-
In the Knowledge bases section, select Create knowledge base.
-
Provide the knowledge base details.
-
Provide the knowledge base name and optional description.
-
Provide the AWS Identity and Access Management role for the necessary access
permissions required to create a knowledge base.
The IAM role with all the required permissions
can be created for you as part of the console steps for creating a knowledge base. After
you have completed the steps for creating a knowledge base, the IAM
role with all the required permissions are applied to your specific knowledge base.
-
Create any tags you want to assign to your knowledge base.
Go to the next section to configure your data source.
-
Choose Web Crawler as your data source and provide the configuration details.
(Optional) Change the default Data source name and enter a Description.
-
Provide the Source URLs of the URLs you want to crawl.
You can add up to 9 additional URLs by selecting Add Source URLs. By providing a source URL, you are confirming that you are authorized to crawl its domain.
-
Check the advanced settings. You can optionally change the default selected settings.
For KMS key settings, you can choose either a custom key or use the
default provided data encryption key.
While converting your data into embeddings, Amazon Bedrock encrypts your
transient data with a key that AWS owns and manages, by default.
You can use your own KMS key. For more information, see
Encryption of transient data storage during data ingestion.
For data deletion policy settings, you can choose either:
-
Delete: Deletes all data from your data source that’s converted
into vector embeddings upon deletion of a knowledge base or data source resource.
Note that the vector store itself is not deleted,
only the data. This flag is ignored if an AWS account is deleted.
-
Retain: Retains all data from your data source that’s converted
into vector embeddings upon deletion of a knowledge base or data source resource.
Note that the vector store itself is not deleted
if you delete a knowledge base or data source resource.
-
Select an option for the scope of crawling your source URLs.
-
Default: Limit crawling to web pages that belong to the same host and with the
same initial URL path. For example, with a seed URL of
"https://aws.amazon.com/bedrock/" then only this path and web pages that extend
from this path will be crawled, like "https://aws.amazon.com/bedrock/agents/".
Sibling URLs like "https://aws.amazon.com/ec2/" are not crawled, for example.
-
Host only: Limit crawling to web pages that belong to the same host. For example, with a
seed URL of "https://aws.amazon.com/bedrock/", then web pages with
"https://docs.aws.amazon.com" will also be crawled, like
"https://aws.amazon.com/ec2".
-
Subdomains: Include crawling of any web page that has the same primary domain as
the seed URL. For example, with a seed URL of
"https://aws.amazon.com/bedrock/" then any web page that
contains "amazon.com" (subdomain) will be crawled, like "https://www.amazon.com".
Make sure you are not crawling potentially excessive web pages. It's not recommended to
crawl large websites, such as wikipedia.org, without filters or scope limits. Crawling
large websites will take a very long time to crawl.
-
Enter Maximum throttling of crawling speed. Ingest URLs between 1 and 300 URLs per host per minute. A higher crawling speed increases the load but takes less time.
-
For URL Regex patterns (optional) you can add Include patterns or Exclude patterns by entering the regular expression pattern in the box.
You can add up to 25 include and 25 exclude filter patterns by selecting Add new pattern.
The include and exclude patterns are crawled in accordance with your scope.
If there's a conflict, the exclude pattern takes precedence.
-
Choose either the default or customized chunking and parsing configurations.
-
If you choose custom settings, select one of the following chunking options:
-
Fixed-size chunking: Content split into chunks of text of your set
approximate token size. You can set the maximum number of tokens that
must not exceed for a chunk and the overlap percentage between
consecutive chunks.
-
Default chunking: Content split into chunks of text of up to 300
tokens. If a single document or piece of content contains less than
300 tokens, the document is not further split.
-
Hierarchical chunking: Content organized into nested structures
of parent-child chunks. You set the maximum parent chunk token size
and the maximum child chunk token size. You also set the absolute
number of overlap tokens between consecutive parent chunks and
consecutive child chunks.
-
Semantic chunking: Content organized into semantically similar text
chunks or groups of sentences. You set the maximum number of sentences
surrounding the target/current sentence to group together (buffer size).
You also set the breakpoint percentile threshold for dividing the text
into meaningful chunks. Semantic chunking uses a foundation model. View
Amazon Bedrock pricing
for information on the cost of foundation models.
-
No chunking: Each document is treated as a single text chunk. You might
want to pre-process your documents by splitting them into separate files.
You can’t change the chunking strategy after you have created the data source.
-
You can choose to use Amazon Bedrock’s foundation model for parsing documents to
parse more than standard text. You can parse tabular data within documents with their
structure intact, for example. View Amazon Bedrock pricing for information on the cost of foundation models.
-
You can choose to use an AWS Lambda function to customize your chunking strategy and
how your document metadata attributes/fields are treated and ingested. Provide the
Amazon S3 bucket location for the Lambda function input and output.
Go to the next section to configure your vector store.
-
Choose a model for converting your data into vector embeddings.
Create a vector store to allow Amazon Bedrock to store, update, and manage embeddings.
You can quick create a new vector store or select from a supported vector store
you have created. If you create a new vector store, an Amazon OpenSearch
Serverless vector search collection and index with the required fields is set
up for you. If you select from a supported vector store, you must map the vector
field names and metadata field names.
Go to the next section to review your knowledge base configurations.
-
Check the details of your knowledge base. You can edit any
section before going ahead and creating your knowledge base.
The time it takes to create the knowledge base depends on the amount of data
you ingest and your specific configurations. When the knowledge base is finished
being created, the status of the knowledge base changes to Ready.
Once your knowledge base is ready or completed creating, sync your data source
for the first time and whenever you want to keep your content up to date.
Select your knowledge base in the console and select Sync within
the data source overview section.
- CLI
-
The following is an example of a configuration of Web Crawler for your Amazon Bedrock
knowledge base.
{
"webConfiguration": {
"sourceConfiguration": {
"urlConfiguration": {
"seedUrls": [{
"url": "https://www.examplesite.com"
}]
}
},
"crawlerConfiguration": {
"crawlerLimits": {
"rateLimit": 50
},
"scope": "HOST_ONLY",
"inclusionFilters": [
"https://www\.examplesite\.com/.*\.html"
],
"exclusionFilters": [
"https://www\.examplesite\.com/contact-us\.html"
]
}
},
"type": "WEB"
}