Documentation best practices for RAG applications - AWS Prescriptive Guidance

Documentation best practices for RAG applications

Developing a successful Retrieval-Augmented Generation (RAG) application requires careful consideration of various document-related factors to optimize its performance. The best practices in this section are curated based on experience building RAG systems with many organization leaders. The following are several key best practices for documents to enhance the effectiveness of your RAG application:

  • Use headings and subheadings properly – Organizing your content with clear headings and subheadings improves readability and helps RAG models understand the structure of your documents. This practice enables the models to better navigate and extract information from the documents, which enhances the quality of the generated responses.

  • Ensure numbering is sequential – When using numbered lists, it's important to maintain proper numbering to avoid confusion. Ensure that each list item is numbered sequentially without skipping numbers. This helps maintain clarity and coherence in your content.

  • Add transitions between list items – Providing transitions between items in a bulleted or numbered list helps guide the LLM through the content. For example, you can use phrases like "After completing step 2, do..." to connect ideas and improve the flow of information.

  • Replace tables – Avoid using tables. Format this information in multi-level bulleted lists or in a flat-level syntax. Flat-level syntax is arranging elements or items at the same hierarchical level, without nested levels of subordination. These structures help LLMs to digest the information. Because most indexed documents are read from left to right, flat-level syntax allows information to follow more coherently without needing to reference an additional dimension. This format is more conducive to RAG applications because it presents information in a structured and easily digestible manner.

  • Preprocess graphical information for efficiency – Multi-modal LLMs can ingest both image and text. Reduce the resolution of images, remove redundant images, and describe the content of graphical elements in text format. These measures improve meaningful context, avoid consuming tokens unnecessarily, and improve accessibility for RAG models.

  • Add session starters for common queries – When addressing common questions or tasks, such as "How do I order software?", add a session starter that transitions the reader into the process. For example, you might add "If you are looking to order software, follow the steps below…". This helps to create high semantic matching, which helps the LLM construct a cohesive response.

  • Add summarization to each section After each heading or subheading, add a brief and concise summary of the content in that section. This can increase semantic coverage and reinforce key points. This improves the accuracy of similarity search within the embedding space, thus improving the performance of the RAG application. This is particularly helpful if the document is intended for both LLM and human consumption or if table and graphical elements are necessary.

  • Disambiguation – Documents should be concise and focused. LLMs generate responses based on retrieved excerpts, so disambiguation helps the model use clear and relevant information. This results in more accurate and informative responses.

  • Define abbreviations and set context – LLMs are trained on large amount of internet data, and most of the time, they do not have the context of an enterprise's internal documents. Therefore, setting context, defining abbreviations, and avoiding or defining company-specific terminology helps the LLM understand your enterprise data. This helps the LLM to answer questions more accurately and can help prevent hallucinations.

  • Restructure large documents into smaller documents for efficient tagging and indexing Avoid indexing a large document that contains multiple subtopics. Consider dividing the large document into smaller, self-contained documents that have clear titles. This improves indexing and tagging.