Connecting Amazon Q Business to Web Crawler using APIs
To connect Amazon Q Business to Web Crawler using the Amazon Q API, call
CreateDataSource
. Use this API to:
-
provide a name and tags for your data source
-
an Amazon Resource Name (ARN) of an IAM role with permission to access the data source and required resources
-
a sync schedule for Amazon Q to check the documents in your data source
-
a Amazon VPC configuration
For more information on available parameters, see CreateDataSource in the Amazon Q API reference.
Provide the seed or starting point URLs, or the sitemap URLs, as part of the connection configuration or repository endpoint details. Also specify the website authentication credentials and authentication type if your websites require authentication, and other necessary configurations.
Web Crawler JSON schema
The following is the Web Crawler JSON schema:
{ "$schema": "http://json-schema.org/draft-04/schema#", "type": "object", "properties": { "connectionConfiguration": { "type": "object", "properties": { "repositoryEndpointMetadata": { "type": "object", "properties": { "siteMapUrls": { "type": "array", "items":{ "type": "string", "pattern": "https://.*" } }, "s3SeedUrl": { "type": ["string", "null"], "pattern": "s3:.*" }, "s3SiteMapUrl": { "type": ["string", "null"], "pattern": "s3:.*" }, "seedUrlConnections": { "type": "array", "items": [ { "type": "object", "properties": { "seedUrl":{ "type": "string", "pattern": "https://.*" } }, "required": [ "seedUrl" ] } ] }, "authentication": { "type": "string", "enum": [ "NoAuthentication", "BasicAuth", "NTLM_Kerberos", "Form", "SAML" ] } } } }, "required": [ "repositoryEndpointMetadata" ] }, "repositoryConfigurations": { "type": "object", "properties": { "webPage": { "type": "object", "properties": { "fieldMappings": { "type": "array", "items": [ { "type": "object", "properties": { "indexFieldName": { "type": "string" }, "indexFieldType": { "type": "string", "enum": [ "STRING", "DATE", "LONG" ] }, "dataSourceFieldName": { "type": "string" }, "dateFieldFormat": { "type": "string", "pattern": "yyyy-MM-dd'T'HH:mm:ss'Z'" } }, "required": [ "indexFieldName", "indexFieldType", "dataSourceFieldName" ] } ] } }, "required": [ "fieldMappings" ] }, "attachment": { "type": "object", "properties": { "fieldMappings": { "type": "array", "items": [ { "type": "object", "properties": { "indexFieldName": { "type": "string" }, "indexFieldType": { "type": "string", "enum": [ "STRING", "DATE", "LONG" ] }, "dataSourceFieldName": { "type": "string" }, "dateFieldFormat": { "type": "string", "pattern": "yyyy-MM-dd'T'HH:mm:ss'Z'" } }, "required": [ "indexFieldName", "indexFieldType", "dataSourceFieldName" ] } ] } }, "required": [ "fieldMappings" ] } } }, "syncMode": { "type": "string", "enum": [ "FORCED_FULL_CRAWL", "FULL_CRAWL" ] }, "additionalProperties": { "type": "object", "properties": { "rateLimit": { "type": "string", "default": "300" }, "maxFileSize": { "type": "string", "default": "50" }, "maxFileSizeInMegaBytes": { "type": "string" }, "crawlDepth": { "type": "string", "default": "2" }, "maxLinksPerUrl": { "type": "string", "default": "100" }, "crawlSubDomain": { "type": "boolean", "default": false }, "crawlAllDomain": { "type": "boolean", "default": false }, "honorRobots": { "type": "boolean", "default": false }, "crawlAttachments": { "type": "boolean", "default": false }, "inclusionURLCrawlPatterns": { "type": "array", "items": { "type": "string" } }, "exclusionURLCrawlPatterns": { "type": "array", "items": { "type": "string" } }, "inclusionURLIndexPatterns": { "type": "array", "items": { "type": "string" } }, "exclusionURLIndexPatterns": { "type": "array", "items": { "type": "string" } }, "inclusionFileIndexPatterns": { "type": "array", "items": { "type": "string" } }, "exclusionFileIndexPatterns": { "type": "array", "items": { "type": "string" } }, "proxy": { "type": "object", "properties": { "host": { "type": "string" }, "port": { "type": "string" }, "secretArn": { "type": "string", "minLength": 20, "maxLength": 2048 } } } }, "required": [ "rateLimit", "maxFileSize", "crawlDepth", "crawlSubDomain", "crawlAllDomain", "maxLinksPerUrl", "honorRobots" ] }, "type": { "type": "string", "enum": [ "WEBCRAWLERV2", "WEBCRAWLER" ] }, "secretArn": { "type": "string", "minLength": 20, "maxLength": 2048 } }, "version": { "type": "string", "anyOf": [ { "pattern": "1.0.0" } ] }, "required": [ "connectionConfiguration", "repositoryConfigurations", "syncMode", "type", "additionalProperties" ] }
The following provides information about important JSON keys to configure.
Configuration | Description |
---|---|
connectionConfiguration |
Configuration information for the endpoint for the data source. |
repositoryEndpointMetadata |
The endpoint information for the data source. |
siteMapUrls |
The list of sitemap URLs for the websites that you want to crawl. You can list up to three sitemap URLs. |
s3SeedUrl |
The S3 path to the text file that stores the list of seed or starting point URLs. For example, s3://bucket-name/directory/. Each URL in the text file must be formatted on a separate line. You can list up to 100 seed URLs in a file. |
s3SiteMapUrl |
The S3 path to the sitemap XML files. For example, s3://bucket-name/directory/. You can list up to three sitemap XML files. You can club together multiple sitemap files into a .zip file and store the .zip file in your Amazon S3 bucket. |
seedUrlConnections |
The list of seed or starting point URLs for the websites that you want to crawl.You can list up to 100 seed URLs. |
seedUrl |
The seed or starting point URL. |
authentication |
The authentication type if your websites require the same authentication, otherwise
specify NoAuthentication . |
repositoryConfigurations |
Configuration information for the content of the data source. For example, configuring specific types of content and field mappings. |
|
A list of objects that map the attributes or field names of your webpages and
webpage files to Amazon Q index field names. For example, the HTML webpage title tag
can be mapped to the _document_title index field. |
syncMode |
Specify whether Amazon Q should update your index by syncing all documents or only new, modified, and deleted documents. You can choose between the following options:
|
additionalProperties |
Additional configuration options for your content in your data source. |
rateLimit |
The maximum number of URLs crawled per website host per minute. |
maxFileSize |
The maximum size (in MB) of a webpage or attachment to crawl. |
maxFileSizeInMegaBytes |
Specify the maximum single file size limit in MBs that Amazon Q will crawl. Amazon Q will crawl only the files within the size limit you define. The default file size is 50MB. The maximum file size should be greater than 0MB and less than or equal to 50MB. |
crawlDepth |
The number of levels from the seed URL to crawl. For example, the seed URL page is depth 1 and any hyperlinks on this page that are also crawled are depth 2. |
maxLinksPerUrl |
The maximum number of URLs on a webpage to include when crawling a website. This number is per webpage. As a website's webpages are crawled, any URLs that the webpages link to also are crawled. URLs on a webpage are crawled in order of appearance. |
crawlSubDomain |
true to crawl the website domains with subdomains only. For example,
if the seed URL is "abc.example.com", then
"a.abc.example.com" and "b.abc.example.com" are also
crawled. If you don't set crawlSubDomain or
crawlAllDomain to true , then Amazon Q only
crawls the domains of the websites that you want to crawl. |
crawlAllDomain |
true to crawl the website domains with subdomains and other domains
the web pages link to. If you don't set crawlSubDomain or
crawlAllDomain to true , then Amazon Q only
crawls the domains of the websites that you want to crawl. |
honorRobots |
true to respect the robots.txt directives of the websites that you
want to crawl. These directives control how Amazon Q Web Crawler crawls the websites,
and whether Amazon Q can crawl only specific content or not crawl any content. ImportantThe |
crawlAttachments |
true to crawl files that the webpages link to. |
|
A list of regular expression patterns to include crawling certain URLs and indexing any hyperlinks on these URL webpages. URLs that match the patterns are included in the index. URLs that don't match the patterns are excluded from the index. If a URL matches both an inclusion and exclusion pattern, the exclusion pattern takes precedence, and the URL and website's webpages aren't included in the index. |
|
A list of regular expression patterns to exclude crawling certain URLs and indexing any hyperlinks on these URL webpages. URLs that match the patterns are excluded from the index. URLs that don't match the patterns are included in the index. If a URL matches both an inclusion and exclusion pattern, the exclusion pattern takes precedence, and the URL/website's webpages aren't included in the index. |
inclusionFileIndexPatterns |
A list of regular expression patterns to include certain web page files. Files that match the patterns are included in the index. Files that don't match the patterns are excluded from the index. If a file matches both an inclusion and exclusion pattern, the exclusion pattern takes precedence, and the file isn't included in the index. |
exclusionFileIndexPatterns |
A list of regular expression patterns to exclude certain webpage files. Files that match the patterns are excluded from the index. Files that don't match the patterns are included in the index. If a file matches both an inclusion and exclusion pattern, the exclusion pattern takes precedence, and the file isn't included in the index. |
proxy |
Configuration information required to connect to your internal websites through a web proxy. |
host |
The host name of the proxy server that you want to use to connect to internal websites. For example, the host name of https://a.example.com/page1.html is "a.example.com". |
port |
The port number of the proxy server that you want to use to connect to internal websites. For example, 443 is the standard port for HTTPS. |
secretArn (proxy) |
If web proxy credentials are required to connect to a website host, you can create an AWS Secrets Manager secret that stores the credentials. Provide the Amazon Resource Name (ARN) of the secret. |
type |
The type of data source. Specify WEBCRAWLERV2 as your
data source type. |
secretARN |
The Amazon Resource Name (ARN) of an AWS Secrets Manager secret that's used if your websites require authentication to access the websites. You store the authentication credentials for the website in the secret that contains JSON key-value pairs. If you use basic, or NTLM/Kerberos, enter the username and password. The JSON keys
in the secret must be If you use SAML or form authentication, enter the username and password, XPath for
the username field (and username button if using SAML), XPaths for the password field
and button, and the login page URL. The JSON keys in the secret must be
Amazon Q also checks if the endpoint information (seed URLs) included in the secret is the same the endpoint information specified in your data source endpoint configuration details. |
version |
The version of this template that's currently supported. |