Connecting Amazon Q custom connector to Amazon Q Business
Use a custom data source when you have a repository that Amazon Q Business doesn’t yet provide a data source connector for. When you create a custom data source, you have complete control over how the documents to index are selected. Amazon Q only provides metric information that you can use to monitor your data source sync jobs. You must create and run the crawler that determines the documents your data source indexes.
You can use a custom data source connector to:
-
See the same run history metrics that Amazon Q data sources provide even when you can't use Amazon Q data sources to sync your repositories.
-
Create a consistent sync monitoring experience between Amazon Q data sources and custom data sources.
-
See sync metrics for a data source connector that you created using the BatchPutDocument and BatchDeleteDocument API operations.
You can create an Amazon Q custom data source connector using either the AWS Management Console or the CreateDataSource.
When you create a custom data source using the
CreateDataSource
API operation:
-
The action returns an ID to use when you synchronize the data source.
-
You have to set the
Configuration
parameter as the following:"configuration": { "type": "CUSTOM", "version": "1.0.0" }
-
You must specify the main title of your documents using the Document object, and
_source_uri
in DocumentAttribute. The main title is required so thatDocumentTitle
andDocumentURI
are included in the ChatSync or Chat response.
When you create a custom data source using the console:
-
The console returns an ID to use when you synchronize the data source.
-
Give your data source a name, and optionally a description and resource tags.
-
After the data source is created, a data source ID is shown. Copy this ID to use when you synchronize the data source with the index.
Creating an Amazon Q custom connector
To use a custom data source, create an application environment that is responsible for updating your Amazon Q index. The application environment depends on a crawler that you create. The crawler reads the documents in your repository and determines which documents should be sent to Amazon Q. Your application environment should perform the following steps:
-
Crawl your repository and make a list of the documents in your repository that are added, updated, or deleted.
-
Call the StartDataSourceSyncJob API operation to signal that a sync job is starting. You provide a data source ID to identify the data source that is synchronizing. Amazon Q returns an execution ID to identify a particular sync job.
Note
After you end a sync job, you can start a new sync job. There can be a period of time before all of the submitted documents are added to the index. To see the status of the sync job, use the ListDataSourceSyncJobs operation. If the
Status
returned for the sync job isSYNCING_INDEXING
, some documents are still being indexed. You can start a new sync job when the status of the previous job isFAILED
orSUCCEEDED
. -
To remove documents from the index, use the BatchDeleteDocument operation. You provide the data source ID and execution ID to identify the data source that is synchronizing and the job that this update is associated with.
-
To signal the end of the sync job, use the StopDataSourceSyncJob operation. After you call the
StopDataSourceSyncJob
operation, the associated execution ID is no longer valid.Note
After you call the
StopDataSourceSyncJob
operation, you can't use a sync job identifier in a call to theBatchPutDocument
orBatchDeleteDocument
operations. If you do, all of the documents submitted are returned in theFailedDocuments
response message from the API. -
To list the sync jobs for the data source and to see metrics for the sync jobs, use the ListDataSourceSyncJobs operation with the index and data source identifiers.
Required attributes
When you submit a document to Amazon Q using the BatchPutDocument
API
operation, you must provide the following two attributes for each document:
-
_data_source_id
– The identifier of the data source. This is returned when you create the data source with either the console or theCreateDataSource
API operation. -
_data_source_sync_job_execution_id
– The identifier of the sync run. This is returned when you start the index synchronization with theStartDataSourceSyncJob
operation.
The following is the JSON required to index a document using a custom data source.
{
"Documents": [
{
"Attributes": [
{
"Key": "_data_source_id",
"Value": {
"StringValue": "data source identifier
"
}
},
{
"Key": "_data_source_sync_job_execution_id",
"Value": {
"StringValue": "sync job identifier
"
}
}
],
"Blob": "document content
",
"ContentType": "content type
",
"Id": "document identifier
",
"Title": "document title
"
}
],
"IndexId": "index identifier
",
"RoleArn": "IAM role ARN
"
}
When you remove a document from the index using the BatchDeleteDocument
API operation, you must specify the following two fields in the
DataSourceSyncJobMetricTarget
parameter:
-
DataSourceId
– The identifier of the data source. This is returned when you create the data source with either the console or theCreateDataSource
API operation. -
DataSourceSyncJobId
– The identifier of the sync run. This is returned when you start the index synchronization with theStartDataSourceSyncJob
operation.
The following is the JSON required to delete a document from the index using the
BatchDeleteDocument
operation.
{
"DataSourceSyncJobMetricTarget": {
"DataSourceId": "data source identifier
",
"DataSourceSyncJobId": "sync job identifier
"
},
"DocumentIdList": [
"document identifier
"
],
"IndexId": "index identifier
"
}
Viewing metrics
After a sync job is finished, you can use the DataSourceSyncJobMetrics
API operation to get the metrics associated with the sync job. Use this API operation to
monitor your custom data source syncs.
You can submit the same document multiple times, either as part of the
BatchPutDocument
operation, the BatchDeleteDocument
operation, or if the document is submitted for both addition and deletion, Regardless of
how you submit the document, it is only counted once in the metrics.
-
DocumentsAdded
– The number of documents submitted using theBatchPutDocument
operation associated with this sync job that are added to the index for the first time. If a document is submitted for addition more than once in a sync, the document is only counted once in the metrics. -
DocumentsDeleted
– The number of documents submitted using theBatchDeleteDocument
operation associated with this sync job that are deleted from the index. If a document is submitted for deletion more than once in a sync, the document is only counted once in the metrics. -
DocumentsFailed
– The number of documents associated with this sync job that failed indexing. These documents were accepted by Amazon Q for indexing but could not be indexed or deleted. If a document isn't accepted by Amazon Q, the identifier for the document is returned in theFailedDocuments
response property of theBatchPutDocument
andBatchDeleteDocument
operations. -
DocumentsModified
– The number of modified documents submitted using theBatchPutDocument
operation associated with this sync job that were modified in the Amazon Q index.
Amazon Q also emits Amazon CloudWatch metrics while indexing documents. For more information, see Monitoring Amazon Q with Amazon CloudWatch.
Amazon Q doesn't return the DocumentsScanned
metric for custom data
sources.