Web Crawler configuration properties Web Crawler JSON schema Web Crawler YAML schema

Connecting Amazon Q Business to Web Crawler using AWS CloudFormation

You use the AWS::QBusiness::DataSource resource to connect a data source to your Amazon Q application.

Use the configuration property to provide a JSON or YAML schema with the necessary configuration details specific to your data source connector.

Topics

Web Crawler configuration properties
Web Crawler JSON schema for using the configuration property with AWS CloudFormation
Web Crawler YAML schema for using the configuration property with AWS CloudFormation

Web Crawler configuration properties

The following provides information about important configuration properties required in the schema.

Configuration	Description	Type	Required
`type`	The type of data source.	`string` The only allowed values are `WEBCRAWLERV2` (Recommended) `WEBCRAWLER`	Yes
`syncMode`	Specify whether Amazon Q should update your index by syncing all documents or only new, modified, and deleted documents.	`string` You can choose between the following options: Use `FORCED_FULL_CRAWL` to freshly re-crawl all content and replace existing content each time your data source syncs with your index. Use `FULL_CRAWL` to incrementally crawl only new, modified, and deleted content each time your data source syncs with your index.	Yes
`connectionConfiguration`	Configuration information for the endpoint for the data source.	`object` This property has the sub-property `repositoryEndpointMetadata`.	Yes
`repositoryEndpointMetadata`	The endpoint information for the data source. This is a sub-property for the `connectionConfiguration`.	`object` This property has the following sub-properties `authentication` `seedUrlConnections` `s3SeedUrl` `siteMapUrls` `s3SiteMapUrl`	Yes
`authentication`	The authentication type if your websites require the same authentication. This is a sub-property for the `repositoryEndpointMetadata`.	`string` Depending on your connection type, the allowed values are `NoAuthentication`, `BasicAuth`, `NTLM_Kerberos`, `Form`, `SAML`.	Yes
`seedUrlConnections`	The list of seed or starting point URLs for the websites that you want to crawl. You can list up to 100 seed URLs. This is a sub-property for the `repositoryEndpointMetadata`.	`array` This is an array of `seedUrl`s. Use the pattern: [`https://.*`].	No
`seedUrl`	The seed or starting point URL for the websites that you want to crawl. This is a sub-property for the `seedUrlConnections`.	`string` Use the pattern: [`https://.*`].	Yes
`s3SeedUrl`	The S3 path to the text file that stores the list of seed or starting point URLs. This is a sub-property for the `repositoryEndpointMetadata`.	`string` Use the pattern `s3://bucket-name/directory/` . Each URL in the text file must be formatted on a separate line. You can list up to 100 seed URLs in a file.	No
`siteMapUrls`	The list of sitemap URLs for the websites that you want to crawl. This is a sub-property for the `repositoryEndpointMetadata`.	`array` This is an array of `siteMapUrls`. You can list up to three sitemap URLs. Use the pattern: [`https://.*`].	No
`s3SiteMapUrl`	The S3 path to the sitemap XML files. This is a sub-property for the `repositoryEndpointMetadata`.	`string` Use the pattern, s3://bucket-name/directory/. You can list up to three sitemap XML files. You can club together multiple sitemap files into a .zip file and store the .zip file in your Amazon S3 bucket.	No
`repositoryConfigurations`	Configuration information for the content of the data source. For example, configuring specific types of content and field mappings.	`object` This property has the following sub-properties `webPage` `attachment`	Yes
`webPage` `attachments`	A list of objects that map the attributes or field names of your webpages and attachments to Amazon Q index field names.	`object` These properties has the following sub-properties `indexFieldName` `indexFieldType` `dataSourceFieldName` `dateFieldFormat`	No
`indexFieldName`	The name of the index field. This is a sub-property for `webPage` and `attachments`.	`string`	Yes
`indexFieldType`	The type of the index field. This is a sub-property for `webPage` and `attachments`.	`string` The only allowed value are `STRING` `DATE` `LONG`	Yes
`dataSourceFieldName`	The field name of the data source. This is a sub-property for `webPage` and `attachments`.	`string`	Yes
`dateFieldFormat`	The field date of the data source. This is a sub-property for `webPage` and `attachments`.	`string` Use the pattern `yyyy-MM-dd'T'HH:mm:ss'Z'`	No
`additionalProperties`	Additional configuration options for your content in your data source.	`object` This property has the following sub-properties that are not required `rateLimit` `metadataFilesPrefix`. `maxFileSize` `maxFileSizeInMegaBytes` `crawlDepth` `maxLinksPerUrl` `honorRobots` `crawlSubDomain` `crawlAllDomain` `crawlAttachments` `maxFileSizeInMegaBytes` `inclusionURLCrawlPatterns` `exclusionURLCrawlPatterns` `inclusionURLIndexPatterns` `exclusionURLIndexPatterns` `inclusionFileIndexPatterns` `exclusionFileIndexPatterns` `proxy`	Yes
`rateLimit`	The maximum number of URLs crawled per website host per minute. This is a sub-property of `additionalProperties`.	`string` The default value is `300`	Yes
`maxFileSize`	The maximum size (in MB) of a webpage or attachment to crawl. This is a sub-property of `additionalProperties`.	`string` The default value is `50`	Yes
`crawlDepth`	The number of levels from the seed URL to crawl. This is a sub-property of `additionalProperties`.	`string` The seed URL page is depth `1` and any hyperlinks on this page that are also crawled are depth `2`. The default value is `2`.	Yes
`maxLinksPerUrl`	The maximum number of URLs on a webpage to include when crawling a website. This number is per webpage. As a website's webpages are crawled, any URLs that the webpages link to also are crawled. URLs on a webpage are crawled in order of appearance. This is a sub-property of `additionalProperties`.	`string` The default value is `100`.	Yes
`honorRobots`	`true` to respect the robots.txt directives of the websites that you want to crawl. These directives control how Amazon Q Web Crawler crawls the websites, and whether Amazon Q can crawl only specific content or not crawl any content. This is a sub-property of `additionalProperties`. Note The `honorRobots` feature is currently only available if you use the API.	`boolean`	Yes
`crawlSubDomain`	`true` to crawl the website domains with subdomains only. This is a sub-property of `additionalProperties`.	`boolean` If the seed URL is "abc.example.com", then "a.abc.example.com" and "b.abc.example.com" are also crawled. If you don't set `crawlSubDomain` or `crawlAllDomain` to `true`, then Amazon Q only crawls the domains of the websites that you want to crawl.	Yes
`crawlAllDomain`	Crawl the website domains with subdomains and other domains the web pages link to. This is a sub-property of `additionalProperties`.	`boolean` If you don't set `crawlSubDomain` or `crawlAllDomain` to `true`, then Amazon Q only crawls the domains of the websites that you want to crawl.	Yes
`crawlAttachments`	`true` to crawl files that the webpages link to. This is a sub-property of `additionalProperties`.	`boolean`	Yes
`maxFileSizeInMegaBytes`	Specify the maximum single file size limit in MBs that Amazon Q will crawl. Amazon Q will crawl only the files within the size limit you define. This is a sub-property of `additionalProperties`.	`string` The default value is `50`. The maximum file size should be greater than `0` and less than or equal to `50`.	No
`inclusionURLCrawlPatterns` `inclusionURLIndexPatterns`	These are sub-properties of `additionalProperties`. A list of regular expression patterns to include crawling certain URLs and indexing any hyperlinks on these URL webpages. URLs that match the patterns are included in the index. URLs that don't match the patterns are excluded from the index. If a URL matches both an inclusion and exclusion pattern, the exclusion pattern takes precedence, and the URL and website's webpages aren't included in the index.	`array`(`string`)	No
`exclusionURLCrawlPatterns` `exclusionURLIndexPatterns`	These are sub-properties of `additionalProperties`. A list of regular expression patterns to exclude crawling certain URLs and indexing any hyperlinks on these URL webpages. URLs that match the patterns are excluded from the index. URLs that don't match the patterns are included in the index. If a URL matches both an inclusion and exclusion pattern, the exclusion pattern takes precedence, and the URL/website's webpages aren't included in the index.	`array`(`string`)	No
`inclusionFileIndexPatterns`	This is a sub-property of `additionalProperties`. A list of regular expression patterns to include certain web page files. Files that match the patterns are included in the index. Files that don't match the patterns are excluded from the index. If a file matches both an inclusion and exclusion pattern, the exclusion pattern takes precedence, and the file isn't included in the index.	`array`(`string`)	No
`exclusionFileIndexPatterns`	This is a sub-property of `additionalProperties`. A list of regular expression patterns to exclude certain webpage files. Files that match the patterns are excluded from the index. Files that don't match the patterns are included in the index. If a file matches both an inclusion and exclusion pattern, the exclusion pattern takes precedence, and the file isn't included in the index.	`array`(`string`)	No
`proxy`	This is a sub-property of `additionalProperties`. Configuration information required to connect to your internal websites through a web proxy.	`object`	No
`host`	This is a sub-property of `proxy`. The host name of the proxy server that you want to use to connect to internal websites. For example, the host name of https://a.example.com/page1.html is `a.example.com`.	`string`	No
`port`	This is a sub-property of `proxy`. The port number of the proxy server that you want to use to connect to internal websites. For example, the port 443 would be `443`.	`string`	No
`secretArn`	This is a sub-property of `proxy`. If web proxy credentials are required to connect to a website host, you can create an AWS Secrets Manager secret that stores the credentials. Provide the Amazon Resource Name (ARN) of the secret.	`string` The minimum length is 20and the maximum length is 2,048 characters The JSON structure for this is `{ "username": string, "password": string }`	No
`secretArn`	The Amazon Resource Name (ARN) of an AWS Secrets Manager secret that's used if your websites require authentication to access the websites. You store the authentication credentials for the website in the secret that contains JSON key-value pairs. If you use basic, or NTLM/Kerberos, enter the username and password. The JSON keys in the secret must be `userName` and `password`. NTLM authentication protocol includes password hashing, and Kerberos authentication protocol includes password encryption. If you use SAML or form authentication, enter the username and password, XPath for the username field (and username button if using SAML), XPaths for the password field and button, and the login page URL. The JSON keys in the secret must be `userName`, `password`, `userNameFieldXpath`, `userNameButtonXpath`, `passwordFieldXpath`, `passwordButtonXpath`, and `loginPageUrl`. You can find the XPaths (XML Path Language) of elements using your web browser's developer tools. XPaths usually follow this format: `//tagname[@Attribute='Value']`. Amazon Q also checks if the endpoint information (seed URLs) included in the secret is the same the endpoint information specified in your data source endpoint configuration details.	If you use `seedUrlConnections` or `s3SeedUrl` as the connection type, you can choose from all authentication values (`NoAuthentication`, `BasicAuth`, `NTLM_Kerberos`, `Form`, `SAML`) depending on the use case. If you use `siteMapUrls` or `s3SiteMapUrl` as connection type, the `authentication` should be `NoAuthentication`. You must use the following JSON structure for your `authentication` values. For `BasicAuth`/`NTLM_Kerberos` `{ "username": String, "password": String }` For `Form`/`SAML` `{ "loginPageUrl": String, "userName": String, "password": String, "userNameFieldXpath": String, "passwordFieldXpath": String, "userNameButtonXpath": String, "passwordButtonXpath": String }` Note XML Path Language (XPaths) of elements can be found using the web browser's developer tools. XPaths usually follow this format: `//tagname[@Attribute='Value'`].	No
`version`	The version of this template that's currently supported.	`string` The default value is `1.0.0`.	No

Web Crawler JSON schema for using the configuration property with AWS CloudFormation

The following is the Web Crawler JSON schema and examples for the configuration property for AWS CloudFormation.

Topics

Web Crawler JSON schema for using the configuration property with AWS CloudFormation
Web Crawler JSON schema example for using the configuration property with AWS CloudFormation

Web Crawler JSON schema for using the configuration property with AWS CloudFormation

The following is the Web Crawler JSON schema for the configuration property for AWS CloudFormation


{
  "type": "object",
  "properties": {
    "type": {
      "type": "string",
      "enum": ["WEBCRAWLERV2", "WEBCRAWLER"]
    },
    "syncMode": {
      "type": "string",
      "enum": ["FORCED_FULL_CRAWL", "FULL_CRAWL"]
    },
    "secretArn": {
      "type": "string",
      "minLength": 20,
      "maxLength": 2048
    },
    "connectionConfiguration": {
      "type": "object",
      "properties": {
        "repositoryEndpointMetadata": {
          "type": "object",
          "properties": {
            "siteMapUrls": {
              "type": "array",
              "items": {
                "type": "string",
                "pattern": "https://.*"
              }
            },
            "s3SeedUrl": {
              "type": ["string", "null"],
              "pattern": "s3:.*"
            },
            "s3SiteMapUrl": {
              "type": ["string", "null"],
              "pattern": "s3:.*"
            },
            "seedUrlConnections": {
              "type": "array",
              "items": [
                {
                  "type": "object",
                  "properties": {
                    "seedUrl": {
                      "type": "string",
                      "pattern": "https://.*"
                    }
                  },
                  "required": ["seedUrl"]
                }
              ]
            },
            "authentication": {
              "type": "string",
              "enum": [
                "NoAuthentication",
                "BasicAuth",
                "NTLM_Kerberos",
                "Form",
                "SAML"
              ]
            }
          }
        }
      },
      "required": ["repositoryEndpointMetadata"]
    },
    "repositoryConfigurations": {
      "type": "object",
      "properties": {
        "webPage": {
          "type": "object",
          "properties": {
            "fieldMappings": {
              "type": "array",
              "items": [
                {
                  "type": "object",
                  "properties": {
                    "indexFieldName": {
                      "type": "string"
                    },
                    "indexFieldType": {
                      "type": "string",
                      "enum": ["STRING", "DATE", "LONG"]
                    },
                    "dataSourceFieldName": {
                      "type": "string"
                    },
                    "dateFieldFormat": {
                      "type": "string",
                      "pattern": "yyyy-MM-dd'T'HH:mm:ss'Z'"
                    }
                  },
                  "required": [
                    "indexFieldName",
                    "indexFieldType",
                    "dataSourceFieldName"
                  ]
                }
              ]
            }
          },
          "required": ["fieldMappings"]
        },
        "attachment": {
          "type": "object",
          "properties": {
            "fieldMappings": {
              "type": "array",
              "items": [
                {
                  "type": "object",
                  "properties": {
                    "indexFieldName": {
                      "type": "string"
                    },
                    "indexFieldType": {
                      "type": "string",
                      "enum": ["STRING", "DATE", "LONG"]
                    },
                    "dataSourceFieldName": {
                      "type": "string"
                    },
                    "dateFieldFormat": {
                      "type": "string",
                      "pattern": "yyyy-MM-dd'T'HH:mm:ss'Z'"
                    }
                  },
                  "required": [
                    "indexFieldName",
                    "indexFieldType",
                    "dataSourceFieldName"
                  ]
                }
              ]
            }
          },
          "required": ["fieldMappings"]
        }
      }
    },
    "additionalProperties": {
      "type": "object",
      "properties": {
        "rateLimit": {
          "type": "string",
          "default": "300"
        },
        "maxFileSize": {
          "type": "string",
          "default": "50"
        },
        "maxFileSizeInMegaBytes": {
          "type": "string"
        },
        "crawlDepth": {
          "type": "string",
          "default": "2"
        },
        "maxLinksPerUrl": {
          "type": "string",
          "default": "100"
        },
        "crawlSubDomain": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "string",
              "enum": ["true", "false"]
            }
          ],
          "default": false
        },
        "crawlAllDomain": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "string",
              "enum": ["true", "false"]
            }
          ],
          "default": false
        },
        "honorRobots": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "string",
              "enum": ["true", "false"]
            }
          ],
          "default": false
        },
        "crawlAttachments": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "string",
              "enum": ["true", "false"]
            }
          ],
          "default": false
        },
        "inclusionURLCrawlPatterns": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "exclusionURLCrawlPatterns": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "inclusionURLIndexPatterns": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "exclusionURLIndexPatterns": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "inclusionFileIndexPatterns": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "exclusionFileIndexPatterns": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "proxy": {
          "type": "object",
          "properties": {
            "host": {
              "type": "string"
            },
            "port": {
              "type": "string"
            },
            "secretArn": {
              "type": "string",
              "minLength": 20,
              "maxLength": 2048
            }
          }
        },
        "enableDeletionProtection": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "string",
              "enum": ["true", "false"]
            }
          ],
          "default": false
        },
        "deletionProtectionThreshold": {
          "type": "string",
          "default": "15"
        }
      },
      "required": [
        "rateLimit",
        "maxFileSize",
        "crawlDepth",
        "crawlSubDomain",
        "crawlAllDomain",
        "maxLinksPerUrl",
        "honorRobots"
      ]
    }
  },
  "version": {
    "type": "string",
    "anyOf": [
      {
        "pattern": "1.0.0"
      }
    ]
  },
  "required": [
    "type",
    "syncMode",
    "connectionConfiguration",
    "repositoryConfigurations",
    "additionalProperties"
  ]
}

Show less

Web Crawler JSON schema example for using the configuration property with AWS CloudFormation

The following is the Web Crawler JSON schema example for the configuration property for AWS CloudFormation


{
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "CloudFormation Web Crawler Data Source Template",
  "Resources": {
    "DataSourceWebCrawler": {
      "Type": "AWS::QBusiness::DataSource",
      "Properties": {
        "ApplicationId": "app12345-1234-1234-1234-123456789012",
        "IndexId": "indx1234-1234-1234-1234-123456789012",
        "DisplayName": "MyWebCrawlerDataSource",
        "RoleArn": "arn:aws:iam::123456789012:role/qbusiness-data-source-role",
        "Configuration": {
          "type": "WEBCRAWLERV2",
          "syncMode": "FULL_CRAWL",
          "secretArn": "arn:aws:secretsmanager:us-west-2:0123456789012:secret",
          "connectionConfiguration": {
            "repositoryEndpointMetadata": {
              "siteMapUrls": ["https://example.com/sitemap.xml"],
              "s3SeedUrl": "s3://bucket/seed-url",
              "s3SiteMapUrl": "s3://bucket/sitemap-url",
              "seedUrlConnections": [
                {
                  "seedUrl": "https://example.com"
                }
              ],
              "authentication": "BasicAuth"
            }
          },
          "repositoryConfigurations": {
            "webPage": {
              "fieldMappings": [
                {
                  "indexFieldName": "title",
                  "indexFieldType": "STRING",
                  "dataSourceFieldName": "page_title",
                  "dateFieldFormat": "yyyy-MM-dd'T'HH:mm:ss'Z'"
                }
              ]
            },
            "attachment": {
              "fieldMappings": [
                {
                  "indexFieldName": "attachment_title",
                  "indexFieldType": "STRING",
                  "dataSourceFieldName": "attachment_name",
                  "dateFieldFormat": "yyyy-MM-dd'T'HH:mm:ss'Z'"
                }
              ]
            }
          },
          "additionalProperties": {
            "rateLimit": "300",
            "maxFileSize": "50",
            "crawlDepth": "2",
            "maxLinksPerUrl": "100",
            "crawlSubDomain": "true",
            "crawlAllDomain": "true",
            "honorRobots": "true"
          }
        }
      }
    }
  }
}

Show less

Web Crawler YAML schema for using the configuration property with AWS CloudFormation

The following is the Web Crawler YAML schema and examples for the configuration property for AWS CloudFormation:

Topics

Web Crawler YAML schema for using the configuration property with AWS CloudFormation
Web Crawler YAML schema example for using the configuration property with AWS CloudFormation

Web Crawler YAML schema for using the configuration property with AWS CloudFormation

The following is the Web Crawler YAML schema for the configuration property for AWS CloudFormation.


type: object
properties:
  type:
    type: string
    enum:
      - WEBCRAWLERV2
      - WEBCRAWLER
  syncMode:
    type: string
    enum:
      - FORCED_FULL_CRAWL
      - FULL_CRAWL
  secretArn:
    type: string
    minLength: 20
    maxLength: 2048
  connectionConfiguration:
    type: object
    properties:
      repositoryEndpointMetadata:
        type: object
        properties:
          siteMapUrls:
            type: array
            items:
              type: string
              pattern: https://.*
          s3SeedUrl:
            type:
              - string
              - null
            pattern: s3:.*
          s3SiteMapUrl:
            type:
              - string
              - null
            pattern: s3:.*
          seedUrlConnections:
            type: array
            items:
              type: object
              properties:
                seedUrl:
                  type: string
                  pattern: https://.*
              required:
                - seedUrl
          authentication:
            type: string
            enum:
              - NoAuthentication
              - BasicAuth
              - NTLM_Kerberos
              - Form
              - SAML
    required:
      - repositoryEndpointMetadata
  repositoryConfigurations:
    type: object
    properties:
      webPage:
        type: object
        properties:
          fieldMappings:
            type: array
            items:
              type: object
              properties:
                indexFieldName:
                  type: string
                indexFieldType:
                  type: string
                  enum:
                    - STRING
                    - DATE
                    - LONG
                dataSourceFieldName:
                  type: string
                dateFieldFormat:
                  type: string
                  pattern: yyyy-MM-dd'T'HH:mm:ss'Z'
              required:
                - indexFieldName
                - indexFieldType
                - dataSourceFieldName
        required:
          - fieldMappings
      attachment:
        type: object
        properties:
          fieldMappings:
            type: array
            items:
              type: object
              properties:
                indexFieldName:
                  type: string
                indexFieldType:
                  type: string
                  enum:
                    - STRING
                    - DATE
                    - LONG
                dataSourceFieldName:
                  type: string
                dateFieldFormat:
                  type: string
                  pattern: yyyy-MM-dd'T'HH:mm:ss'Z'
              required:
                - indexFieldName
                - indexFieldType
                - dataSourceFieldName
        required:
          - fieldMappings
  additionalProperties:
    type: object
    properties:
      rateLimit:
        type: string
        default: "300"
      maxFileSize:
        type: string
        default: "50"
      maxFileSizeInMegaBytes:
        type: string
      crawlDepth:
        type: string
        default: "2"
      maxLinksPerUrl:
        type: string
        default: "100"
      crawlSubDomain:
        anyOf:
          - type: boolean
          - type: string
            enum:
              - "true"
              - "false"
        default: false
      crawlAllDomain:
        anyOf:
          - type: boolean
          - type: string
            enum:
              - "true"
              - "false"
        default: false
      honorRobots:
        anyOf:
          - type: boolean
          - type: string
            enum:
              - "true"
              - "false"
        default: false
      crawlAttachments:
        anyOf:
          - type: boolean
          - type: string
            enum:
              - "true"
              - "false"
        default: false
      inclusionURLCrawlPatterns:
        type: array
        items:
          type: string
      exclusionURLCrawlPatterns:
        type: array
        items:
          type: string
      inclusionURLIndexPatterns:
        type: array
        items:
          type: string
      exclusionURLIndexPatterns:
        type: array
        items:
          type: string
      inclusionFileIndexPatterns:
        type: array
        items:
          type: string
      exclusionFileIndexPatterns:
        type: array
        items:
          type: string
      proxy:
        type: object
        properties:
          host:
            type: string
          port:
            type: string
          secretArn:
            type: string
            minLength: 20
            maxLength: 2048
      enableDeletionProtection:
        anyOf:
          - type: boolean
          - type: string
            enum:
              - "true"
              - "false"
        default: false
      deletionProtectionThreshold:
        type: string
        default: "15"
    required:
      - rateLimit
      - maxFileSize
      - crawlDepth
      - crawlSubDomain
      - crawlAllDomain
      - maxLinksPerUrl
      - honorRobots
version:
  type: string
  anyOf:
    - pattern: 1.0.0
required:
  - type
  - syncMode
  - connectionConfiguration
  - repositoryConfigurations
  - additionalProperties

Show less

Web Crawler YAML schema example for using the configuration property with AWS CloudFormation

The following is the Web Crawler YAML example for the Configuration property for AWS CloudFormation:


AWSTemplateFormatVersion: 2010-09-09
Description: CloudFormation Web Crawler Data Source Template
Resources:
  DataSourceWebCrawler:
    Type: AWS::QBusiness::DataSource
    Properties:
      ApplicationId: app12345-1234-1234-1234-123456789012
      IndexId: indx1234-1234-1234-1234-123456789012
      DisplayName: MyWebCrawlerDataSource
      RoleArn: arn:aws:iam::123456789012:role/qbusiness-data-source-role
      Configuration:
        type: WEBCRAWLERV2
        syncMode: FULL_CRAWL
        secretArn: arn:aws:secretsmanager:us-west-2:0123456789012:my-webcrawler-secret
        connectionConfiguration:
          repositoryEndpointMetadata:
            siteMapUrls:
              - https://example.com/sitemap.xml
            s3SeedUrl: s3://bucket/seed-url
            s3SiteMapUrl: s3://bucket/sitemap-url
            seedUrlConnections:
              - seedUrl: https://example.com
            authentication: BasicAuth
        repositoryConfigurations:
          webPage:
            fieldMappings:
              - indexFieldName: title
                indexFieldType: STRING
                dataSourceFieldName: page_title
                dateFieldFormat: yyyy-MM-dd'T'HH:mm:ss'Z'
          attachment:
            fieldMappings:
              - indexFieldName: attachment_title
                indexFieldType: STRING
                dataSourceFieldName: attachment_name
                dateFieldFormat: yyyy-MM-dd'T'HH:mm:ss'Z'
        additionalProperties:
          rateLimit: "300"
          maxFileSize: "50"
          crawlDepth: "2"
          maxLinksPerUrl: "100"
          crawlSubDomain: "true"
          crawlAllDomain: "true"
          honorRobots: "true"

Show less

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Using the API

Field mappings

Connecting Amazon Q Business to Web Crawler using AWS CloudFormation

Topics

Web Crawler configuration properties

Note

Note

Web Crawler JSON schema for using the configuration property with AWS CloudFormation

Topics

Web Crawler JSON schema for using the configuration property with AWS CloudFormation

Web Crawler JSON schema example for using the configuration property with AWS CloudFormation

Web Crawler YAML schema for using the configuration property with AWS CloudFormation

Topics

Web Crawler YAML schema for using the configuration property with AWS CloudFormation

Web Crawler YAML schema example for using the configuration property with AWS CloudFormation