Web Crawler integration
With Web Crawler integration in Amazon Quick Suite, you can create knowledge bases from website content by crawling and indexing web pages. This integration supports data ingestion capabilities with different authentication options based on your user tier.
What you can do
Web Crawler users can ask questions about content stored on websites and web pages. For example, users can inquire about documentation sites, knowledge bases, or search for specific information across multiple web pages. The integration enables users to quickly access and understand information from web content, regardless of location or type, while providing contextual details such as publication dates, modification history, and page ownership—all contributing to more efficient information discovery and better-informed decision making.
Note
Web Crawler integration supports data ingestion only. It doesn't provide action capabilities for managing websites or web services.
Before you begin
Before you set up Web Crawler integration, make sure you have the following:
-
Website URLs to crawl and index.
-
Amazon Quick Suite Enterprise subscription
-
The website you want to crawl needs to be public and can't be behind a firewall or require special browser plugins to connect.
Prepare website access and authentication
Before setting up the integration in Amazon Quick Suite, prepare your website access credentials. Web Crawler integration supports different authentication methods based on your user role:
- No authentication
-
Available for all users. Use for crawling public websites that don't require authentication.
- Basic authentication
-
Standard HTTP Basic Authentication for secured websites. HTTP Basic Authentication is a simple way to protect web resources by requiring a username and password. When you visit a protected site using Basic Authentication, your browser will show a pop-up dialog box asking for your credentials.
Required credentials:
-
Login page URL - The URL of the login page
Username - Basic auth username
Password - Basic auth password
-
- Form authentication
-
For websites that use HTML form-based login pages.
The formis set up to for you to sepecify XPath. XPath (XML Path Language) is a query language used to navigate through elements and attributes in an HTML or XML document. To identify an XPath for a web page element, a user can utilize their browser's developer tools, typically accessed by right-clicking on the desired element and selecting "Inspect" or pressing F12. Once the element is highlighted in the developer tools, the user can right-click on the corresponding HTML code, select "Copy," and then choose "Copy XPath" from the submenu. This generates a unique path that identifies the element's exact location in the document structure. The resulting XPath might look something like //input[@id='username'] or //button[@type='submit'], where the double forward slashes (//) indicate the path can start anywhere in the document, and the square brackets contain attributes that help identify the specific element.
Required information:
Login page URL - URL of the login form (e.g.,
https://example.com/login)Username - Login username
Password - Login password
Username field XPath - XPath to username input field (e.g.,
//input[@id='username'])-
Username button XPath (Optional) - XPath to username button field (e.g.,
//input[@id='username_button']) Password field XPath - XPath to password input field (e.g.,
//input[@id='password'])Password button XPath - XPath to password button (e.g.,
//button[@type='password'])
- SAML authentication
-
For websites that use SAML-based single sign-on authentication.
SAML (Security Assertion Markup Language) authentication is a federated identity standard that enables single sign-on (SSO) by allowing users to authenticate through a centralized identity provider rather than entering credentials directly into each application. Unlike traditional form authentication where users type their username and password into fields on the application's login page, SAML redirects users to their organization's identity provider (like Microsoft Azure AD or Okta) to authenticate, then passes a secure token back to the application to grant access. This approach provides a seamless user experience across multiple applications, centralized user management for IT administrators, and enhanced security through features like multi-factor authentication, while form authentication requires separate credential management for each individual application
Required information:
Login page URL - URL of the SAML login page
Username - SAML username
Password - SAML password
-
Username field XPath - XPath to username input field (e.g.,
//input[@id='username']) -
Username button XPath (Optional) - XPath to username button field (e.g.,
//input[@id='username_button']) -
Password field XPath - XPath to password input field (e.g.,
//input[@id='password']) -
Password button XPath - XPath to password button (e.g.,
//button[@type='password'])
XPath configuration examples
Use these XPath examples to configure form and SAML authentication:
Username field examples: //input[@id='username'] //input[@name='user'] //input[@class='username-field'] Password field examples: //input[@id='password'] //input[@name='pass'] //input[@type='password'] Submit button examples: //button[@type='submit'] //input[@type='submit'] //button[contains(text(), 'Login')]
Set up Web Crawler integration
After preparing your website access requirements, create the Web Crawler integration in Amazon Quick Suite.
-
In the Amazon Quick Suite console, choose Integrations.
-
Choose Web Crawler from the integration options, and click the Add button (plus "+" button).
-
Choose Access data from Web Crawler. Web Crawler integration supports data access only - action execution is not available for web crawling.
-
Configure integration details and authentication method, then create knowledge bases as needed.
-
Select the authentication type for your web crawler integration.
-
Fill in the required details based on your selected authentication method.
-
Select Create and continue.
-
Fill in the Name and description for your knowledge base.
-
Add the content URLs you want to crawl.
-
Select Create.
-
After clicking create, the data sync is started automatically.
Configure crawling
You can configure which websites and pages to crawl and how to filter the content.
Configure URLs and content sources
Configure which websites and pages to crawl:
Direct URLs
Specify individual URLs to crawl:
https://example.com/docs https://example.com/blog https://example.com/support
Limit: Maximum 10 URLs per dataset
Content filters and crawl settings
Crawl scope settings
To view these settings, you must first set up a knowledge base and then examine the advanced settings option.
- Crawl depth
-
Range: 0-10 (default: 1)
0 = crawl only specified URLs
1 = include linked pages one level deep
Higher values follow links deeper into the site
- Maximum links per page
-
Default: 1000
Maximum: 1,000
Controls how many links to follow from each page
- Wait time
-
Default: 1
-
The amount of time the web crawler will wait for each page after the page reaches "page ready" state. This is useful for pages that have dynamic javascript load characteristics where the page has content blocks that loads after the main template is loaded.Increase wait time if you have visually rich content or anticipate high load times.
Manage knowledge bases
After setting up your Web Crawler integration, you can create and manage knowledge bases from your crawled website content.
Edit existing knowledge bases
You can modify your existing Web Crawler knowledge bases:
-
In the Amazon Quick Suite console, choose Knowledge bases.
-
Select your Web Crawler knowledge base from the list.
-
Choose the three-dot icon under Actions, then choose Edit knowledge base.
-
Update your configuration settings as needed and choose Save.
Create additional knowledge bases
You can create multiple knowledge bases from the same Web Crawler integration:
-
In the Amazon Quick Suite console, choose Integrations, and then select the Data tab.
-
Choose your existing Web Crawler integration from the list.
-
Choose the three-dot icon under Actions, then choose Create knowledge base.
-
Configure your knowledge base settings and choose Create.
For detailed information about knowledge base configuration options, see Common configuration settings.
Attachments and file crawling
Control whether the system processes files and attachments linked from web pages:
-
Enable file attachment crawling - Select this option to crawl and index files and attachments found on web pages, such as PDFs, documents, and media files.
Crawling behavior and sync configuration
Your Web Crawler integration follows these crawling practices:
Incremental sync model: First sync performs full crawl, subsequent syncs capture changes only
Automatic retry: Built-in retry logic for failed requests
Duplicate handling: Automatic detection and handling of URLs
Crawler identification: Identifies itself with user-agent string "aws-quick-on-behalf-of-<UUID>" in request headers
Robots.txt compliance
Web Crawler respects the robots.txt protocol and honors user-agent and allow/disallow directives. This enables you to control how the crawler accesses your site.
How robots.txt checking works
Host-level checking: Web Crawler reads robots.txt files at the host level (for example, example.com/robots.txt)
Multiple host support: For domains with multiple hosts, Web Crawler honors robots rules for each host separately
Fallback behavior: If Web Crawler cannot fetch robots.txt due to blocking, parsing errors, or timeouts, it will behave as if robots.txt does not exist and will crawl the site
Supported robots.txt fields
Web Crawler recognizes these robots.txt fields (field names are case-insensitive, values are case-sensitive):
user-agentIdentifies which crawler the rules apply to
allowA URL path that may be crawled
disallowA URL path that may not be crawled
sitemapThe complete URL of a sitemap
crawl-delaySpecified amount of time (in seconds) to wait between requests to your website
Meta tag support
Web Crawler supports page-level robots meta tags that you can use to control how your data is used. You can specify page-level settings by including a meta tag on HTML pages or in an HTTP header.
Supported meta tags
noindexDo not index the page. If you don't specify this rule, the page may be indexed and eligible to appear in experiences
nofollowDo not follow the links on this page. If you don't specify this rule, Web Crawler may use the links on the page to discover those linked pages
You can combine multiple values using a comma (for example, "noindex, nofollow").
Note
To detect meta tags, Web Crawler needs to access your page, so do not block your page with robots.txt which will prevent it from being recrawled.
Troubleshooting
Use this section to resolve common issues with Web Crawler integration.
Authentication failures
Symptoms:
"Unable to authenticate" error messages
401/403 HTTP responses
Login page redirect loops
Session timeout errors
Resolution steps:
Verify the site is reachable from the AWS region the Amazon Quick Suite instance is setu
Verify credentials accuracy and ensure they haven't expired
Check authentication endpoint availability and accessibility
Validate XPath configurations by testing them in browser developer tools
Review browser network logs to understand the authentication flow
Ensure login page URL is correct and accessible
Test authentication manually using the same credentials
Access and connectivity issues
Symptoms:
Connection timeouts and network errors
Network unreachable errors
DNS resolution failures
Resolution steps:
-
Verify network connectivity to target websites
-
Validate site accessibility:
Check DNS resolution for target domains
Verify SSL/TLS configuration and certificates
Test access from different networks if possible
Crawl and content issues
Symptoms:
Missing or incomplete content
Incomplete crawls or early termination
Rate limiting errors (429 responses)
Content not being indexed properly
Resolution steps:
-
Review robots.txt restrictions:
Check robots.txt file for crawl restrictions
Verify crawler is allowed to access target paths
Ensure robots.txt compliance isn't blocking content
-
Check rate limiting and throttling:
Monitor response headers for rate limit information
Implement appropriate crawl delays
-
Verify URL patterns and filters:
Test regex patterns for accuracy
Check URL formatting and structure
Validate include/exclude pattern logic
-
Review content restrictions:
Check for noindex meta tags on pages
Verify content type support
Ensure content size is within limits
-
Update the Wait time to an appropriate value so the content loads on the page before the cralwer tries crawling
Known limitations
Web Crawler integration has the following limitations:
URL limits: Maximum of 10 URLs, sitemap not supported
Crawl depth: Maximum crawl depth of 10 levels
Security requirements: HTTPS required for web proxy configurations