Evaluating information extraction Evaluating multiple retrievers Using an LLM

Evaluating generative AI solutions for healthcare

Evaluating the healthcare AI solutions you build is critical to making sure that they are effective, reliable, and scalable in real-world medical environments. Use a systematic approach to evaluate the performance of each component of the solution. The following is a summary of the methodologies and metrics that you can use to evaluate your solution.

Topics

Evaluating the extraction of information
Evaluating RAG solutions with multiple retrievers
Evaluating a solution by using an LLM

Evaluating the extraction of information

Evaluate the performance of information extraction solutions, such as the intelligent resume parser and the custom entity extractor. You can measure the alignment of these solutions' responses by using a test dataset. If you don't have a dataset that covers versatile healthcare talent profiles and patient medical records, you can create a custom test dataset by using the reasoning capability of an LLM. For example, you could use a large parameter model, such as Anthropic Claude models, to generate a test dataset.

The following are three key metrics that you can use for evaluating the information extraction models:

Accuracy and completeness – These metrics evaluate the extent to which the output captured the correct and complete information present in the ground truth data. This involves checking both the correctness of the extracted information and the presence of all relevant details in the extracted information.
Similarity and relevance – These metrics assess the semantic, structural, and contextual similarities between the output and the ground truth data (the similarity) and the degree to which the output aligns with and addresses the content, context, and intent of the ground truth data (the relevance).
Adjusted recall or capture rate – These rates empirically determine how many of the present values in the ground truth data were identified correctly by the model. The rate should include a penalization for all false values that the model extracts.
Precision score – The precision score helps you determine how many false positives are present in the predictions, as compared to the true positives. For example, you can use precision metrics to measure the correctness of the extracted skill proficiency.

Evaluating RAG solutions with multiple retrievers

To assess how well the system retrieves relevant information and how effectively it uses that information to generate accurate and contextually appropriate responses, you can use the following metrics:

Response relevancy – Measure how relevant the generated response, which uses the retrieved context, is to the original query.
Context precision – Out of the total retrieved results, evaluate the proportion of retrieved documents or snippets that are relevant to the query. A higher context precision indicates that the retrieval mechanism is effective in selecting relevant information.
Faithfulness – Assesses how accurately the generated response reflects the information in the retrieved context. In other words, measure if the response remains true to the source information.

Evaluating a solution by using an LLM

You can use a technique called LLM-as-a-judge to evaluate the text responses from your generative AI solution. It involves using LLMs to evaluate and assess the performance of model outputs. This technique uses the capabilities of Amazon Bedrock to provide judgments on various attributes, such as response quality, coherence, adherence, accuracy, and completeness to human preferences or ground truth data. You use chain-of-thought (CoT) and few-shot prompting techniques for a comprehensive evaluation. The prompt instructs the LLM to evaluate the generated response with scoring rubric, and the few-shot samples in the prompt demonstrate the actual evaluation process. The prompt also includes guidelines for the LLM evaluator to follow. For example, you could consider using one or more of the following evaluation techniques that use an LLM to judge the generated responses:

Pairwise comparison – Give the LLM evaluator a medical question and multiple responses that were generated by different, iterative versions of the RAG systems you created. Prompt the LLM evaluator to determine the best response based on response quality, coherence, and adherence to the original question.
Single-answer grading – This technique is well suited for use cases where you need to evaluate the accuracy of categorization, such as patient outcome classification, patient behavior categorization, patient re-admission likelihood, and risk categorization. Use the LLM evaluator to analyze individual categorization or classification in isolation, and evaluate the reasoning it has provided against ground truth data.
Reference-guided grading – Provide the LLM evaluator with a series of medical questions that require descriptive answers. Create sample responses to these questions, such as reference answers or ideal responses. Prompt the LLM evaluator to compare the LLM-generated response against the reference answers or ideal responses, and prompt the LLM evaluator to grade the generated response for accuracy, completeness, similarity, relevance, or other attributes. This technique helps you evaluate whether the generated responses align with a well-defined standard or exemplary answer.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Developing solutions

Resources