Supported document formats in Amazon Q Business
You can add documents to an Amazon Q Business application in three ways:
-
Using direct document upload – If you use an Amazon Q retriever, you can directly upload documents into your application.
-
Using data source connectors – You can add documents to an Amazon Q application using both the console and the API.
-
Using Upload files and chat – As an end user using an Amazon Q web experience, you can directly upload up to 5 files during a conversation.
When you add documents to an Amazon Q application (directly or through datasource connectors) using the console or the API, Amazon Q extracts document content and internally parses these to optimize chat responses. The maximum file size of a single document must be 50 MB or less. The maximum amount of text that can be extracted from a single document is 5 MB.
When you upload documents directly into chat using the Upload files and chat feature, the size of each file you upload must be 10 MB or less. The total parsed content for all files combined have to be under 30,000 tokens or 20,000 words. One word corresponds roughly to 1.5 tokens.
Additionally, if you’re uploading Comma Separated Values (CSV) or Microsoft Excel (XLS and XLSX) documents directly into chat, Amazon Q performs best for tables with approximately 4 columns and 10 rows. Files indexed by an Amazon Q data source connector or uploaded directly have no such restrictions.
Along with specific formats like PDF, Word, for example, each enterprise data source also has different entities that it considers documents. To learn about supported entity types for each data source, see What is a document?.
Supported document types
The following table shows the document formats that Amazon Q Business supports.
Document format | How document is treated |
---|---|
Portable Document Format (PDF) | Converted to HTML, then plain text is extracted. Scanned PDFs aren't supported as they are images. |
HyperText Markup Language (HTML) | HTML tags are filtered out to extract plain text. Content must be
between the main HTML start and closing tags
(<HTML>content</HTML> ). |
Extensible Markup Language (XML) | XML tags are filtered out and plain text is extracted. |
Extensible Stylesheet Language Transformations (XSLT) | Tags are filtered out to extract plain text. |
Markdown (MD) | Content is extracted as plain text with Markdown syntax retained. |
Comma Separated Values (CSV) | Content is extracted as plain text from each cell, with a single file treated as a single document result. Amazon Q doesn't support analytics questions for CSVs; it supports only qualitative questions. |
Microsoft Excel (XLS and XLSX) | Content is extracted as plain text from each cell, with a single row treated as a single document result. Amazon Q doesn't support analytics questions for Excel files; it supports only qualitative questions. |
JavaScript Object Notation (JSON) | Content is extracted as plain text with JSON syntax retained. |
Rich Text Format (RTF) | RTF syntax is filtered out to extract plain text content. |
Microsoft PowerPoint (PPT, PPTX) | Only plain text content is extracted from PowerPoint slides for ingestion. Images and other content aren't extracted. |
Microsoft Word (DOCX) | Only plain text content is extracted from Word pages for ingestion. Images and other content aren't extracted. |
Plain text (TXT) | All text in the text document is extracted. |
What is a document?
When you directly add files to Amazon Q Business using the Using direct document upload or the Upload files and chat feature, it considers each file you add a document. When you connect Amazon Q to a data source, what Amazon Q considers—and crawls—as a document varies by connector.
The following table outlines what each connector crawls as a document.
Data source connector | Supports crawling | Document definition |
---|---|---|
Adobe Experience Manager (Cloud and Server) |
|
|
Alfresco (Cloud and Server) |
|
|
Amazon FSx (Windows) | Files |
Each File is considered a single document. |
Amazon S3 | Objects |
Each Object is considered a single document. Any
|
Amazon Q Web Crawler |
|
|
Amazon WorkDocs |
|
|
Box |
|
|
Confluence (Cloud and Server) |
|
|
Database data sources
|
|
Each row in a table and view is considered a single document. |
Dropbox |
|
|
Drupal |
|
|
GitHub (Cloud and Server) |
|
|
Gmail |
|
|
Google Drive |
|
|
Jira |
|
|
Microsoft Exchange |
|
|
Microsoft OneDrive |
|
|
Microsoft SharePoint (Online and Server) |
|
|
Microsoft Teams |
|
|
Microsoft Yammer |
|
|
Quip |
|
|
Salesforce |
|
|
ServiceNow |
|
|
Slack |
|
|
Zendesk |
|
|