Supported document formats in Amazon Q Business

When you add documents to an Amazon Q Business application environment (directly or through data source connectors) using the console or the API, Amazon Q Business extracts document content and internally parses these to optimize chat responses. The maximum file size of a single document must be 50 MB or less. The maximum amount of text that can be extracted from a single document is 5 MB.

When you upload documents directly into chat using the Upload files and chat feature, the size of each file you upload must be 10 MB or less. The total parsed content for all files combined has to be under 30,000 tokens or 20,000 words. One word corresponds roughly to 1.5 tokens.

For images, chat now supports direct uploads of JPEG, JPG, and PNG files. These images can be used for summarizing information and answering questions.

Additionally, if you’re uploading Comma Separated Values (CSV) or Microsoft Excel (XLS and XLSX) documents directly into chat, Amazon Q Business performs best for tables with approximately 4 columns and 10 rows. Files indexed by an Amazon Q Business data source connector or uploaded directly have no such restrictions.

When you directly add files to Amazon Q Business using the Using direct document upload or the Upload files and chat feature, it considers each file you add a document. When you connect Amazon Q Business to a data source, what Amazon Q Business considers—and crawls—as a document varies by connector.

Along with specific formats like PDF, Word, for example, each enterprise data source also has different entities that it considers documents. To learn about supported entity types for each data source, see What is a document?.

For some file types (indicated in the table below), Amazon Q Business can extract semantic information and insights from various types of images embedded in the documents. You can enable content extraction when you add or update a data connector, or when you import a file directly. For more information see Extracting semantic meaning from embedded visual content with Amazon Q Business.

Topics

Supported document types

Supported document types

The following table shows the document formats that Amazon Q Business supports.

Note

Provide a description of the file as metadata to improve retrieval accuracy. For more information, see Configuring metadata controls in Amazon Q Business.

Document format	How document is treated
Portable Document Format (PDF)	If you're using a data source connector, or directly uploading documents into your application, Amazon Q can use optical character recognition (OCR) to convert images of typed, handwritten, or printed text into machine-readable text. If enabled, Amazon Q Business can extract semantic information and insights from images and other visuals. The maximum number of pages per PDF for optical character recognition (OCR) or insight extraction from embedded visual elements is 3000 pages. If your PDF is more than 3000 pages, only plaintext is extracted. If you're directly uploading documents into chat, PDFs are converted to HTML, and only plaintext is extracted.
HyperText Markup Language (HTML)	HTML tags are filtered out to extract plaintext. Content must be between the main `HTML` start and closing tags (`<HTML>content</HTML>`).
Extensible Markup Language (XML)	XML tags are filtered out and plaintext is extracted.
Extensible Stylesheet Language Transformations (XSLT)	Tags are filtered out to extract plaintext.
Markdown (MD)	Content is extracted as plaintext with Markdown syntax retained.
Comma Separated Values (CSV)	Content is extracted as plaintext from each cell, with a single file treated as a single document result. You can improve answer accuracy on CSVs by providing a metadata file. Example: `"Attributes":{ "description": "A description your CSV table, a short paragraph providing details about what the use case is and insights you can derive from it.", "_document_title": "document title", "_file_type": "CSV" }`
Microsoft Excel (XLS and XLSX)	Content is extracted as plaintext from each cell, with a single file treated as a single document result. Amazon Q Business doesn't support analytics questions for Excel files; it supports only qualitative questions.
JavaScript Object Notation (JSON)	Content is extracted as plaintext with JSON syntax retained.
Rich Text Format (RTF)	RTF syntax is filtered out to extract plaintext content.
Microsoft PowerPoint (PPT, PPTX)	By default, only plaintext content is extracted from PowerPoint slides for ingestion. If enabled, Amazon Q Business can extract semantic information and insights from images and other visuals. Otherwise, images and other content aren't extracted.
Microsoft Word (DOCX)	By default, only plaintext content is extracted from Word pages for ingestion. If enabled, Amazon Q Business can extract semantic information and insights from images and other visuals. Otherwise, images and other content aren't extracted.
Plain text (TXT)	All text in the text document is extracted.
Google Slides	By default, only plaintext content is extracted from Google Slides pages for ingestion. If enabled, Amazon Q Business can extract semantic information and insights from images and other visuals. Otherwise, images and other content aren't extracted.
Google Docs	By default, only plaintext content is extracted from Google Docs pages for ingestion. If enabled, Amazon Q Business can extract semantic information and insights from images and other visuals. Otherwise, images and other content aren't extracted.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Subscription tiers and index types

Document attributes and types