Amazon Kendra Web Crawler

You can use Amazon Kendra Web Crawler to crawl and index web pages.

You can only crawl public facing websites or internal company websites that use the secure communication protocol Hypertext Transfer Protocol Secure (HTTPS). If you receive an error when crawling a website, it could be that the website is blocked from crawling. To crawl internal websites, you can set up a web proxy. The web proxy must be public facing. You can also use authentication to access and crawl websites.

When selecting websites to index, you must adhere to the Amazon Acceptable Use Policy and all other Amazon terms. Remember that you must only use Amazon Kendra Web Crawler to index your own web pages, or web pages that you have authorization to index. To learn how to stop Amazon Kendra Web Crawler from indexing your website(s), please see Configuring the robots.txt file for Amazon Kendra Web Crawler.

Note

Abusing Amazon Kendra Web Crawler to aggressively crawl websites or web pages you don't own is not considered acceptable use.

Amazon Kendra has two versions of the web crawler connector. Supported features of each version include:

Amazon Kendra Web Crawler connector v1.0 / WebCrawlerConfiguration API

Web proxy
Inclusion/exclusion filters

Amazon Kendra Web Crawler connector v2.0 / TemplateConfiguration API

Field mappings
Inclusion/exclusion filters
Full and incremental content syncs
Web proxy
Basic, NTLM/Kerberos, SAML, and form authentication for your websites
Virtual private cloud (VPC)

Important

Web Crawler v2.0 connector creation is not supported by CloudFormation. Use the Web Crawler v1.0 connector if you need CloudFormation support.

For troubleshooting your Amazon Kendra web crawler data source connector, see Troubleshooting data sources.

Topics

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Using Amazon VPC with Amazon S3

Amazon Kendra Web Crawler connector v1.0