Use the
fmeval
library to run an automatic evaluation
Using the fmeval
library in your own code gives you the most
flexibility to customize your work flow. You can use the fmeval
library
to evaluate any LLM, and also to have more flexibility with your custom input
datasets. The following steps show you how to set up your environment and how to run
both a starting and a customized work flow using the fmeval
library.
Get
started using the fmeval
library
You can configure your foundation model evaluation and customize it for your use case in a Studio notebook. Your configuration depends both on the kind of task that your foundation model is built to predict, and how you want to evaluate it. FMEval supports open-ended generation, text summarization, question answering, and classification tasks. The steps in this section show you how to set up a starting work flow. This starting work flow includes setting up your environment and running an evaluation algorithm using either a JumpStart or an Amazon Bedrock foundation model with built-in datasets. If you must use a custom input dataset and workflow for a more specific use case, see Customize your workflow using the fmeval library.
If you don’t want to run a model evaluation in a Studio notebook, skip to step 11 in the following Get started using Studio section.
Prerequisites
-
To run a model evaluation in a Studio UI, your AWS Identity and Access Management (IAM) role and any input datasets must have the correct permissions. If you do not have a SageMaker AI Domain or IAM role, follow the steps in Guide to getting set up with Amazon SageMaker AI.
To set permissions for your Amazon S3 bucket
After your domain and role are created, use the following steps to add the permissions needed to evaluate your model.
Open the Amazon SageMaker AI console at https://console.aws.amazon.com/sagemaker/
. -
In the navigation pane, enter
S3
into the search bar at the top of the page. -
Choose S3 under Services.
-
Choose Buckets from the navigation pane.
-
In the General purpose buckets section, under Name, choose the name of the S3 bucket that you want to use to store your model input and output in the console. If you do not have an S3 bucket, do the following:
-
Select Create bucket to open a new Create bucket page.
-
In the General configuration section, under AWS Region, select the AWS region where your foundation model is located.
-
Name your S3 bucket in the input box under Bucket name.
-
Accept all of the default choices.
-
Select Create bucket.
-
In the General purpose buckets section, under Name, select the name of the S3 bucket that you created.
-
-
Choose the Permissions tab.
-
Scroll to the Cross-origin resource sharing (CORS) section at the bottom of the window. Choose Edit.
-
To add permissions to your bucket for foundation evaluations, ensure that the following code appears in the input box. You can also copy and paste the following into the input box.
[ { "AllowedHeaders": [ "*" ], "AllowedMethods": [ "GET", "PUT", "POST", "DELETE" ], "AllowedOrigins": [ "*" ], "ExposeHeaders": [ "Access-Control-Allow-Origin" ] } ]
-
Choose Save changes.
To add permissions to your IAM policy
-
In the search bar at the top of the page, enter
IAM
. -
Under Services, select Identity and Access Management (IAM).
-
Choose Policies from the navigation pane.
-
Input AmazonSageMakerFullAccess into the search bar. Select the radio button next to the policy that appears. The Actions button can now be selected.
-
Choose the down arrow next to Actions. Two options appear.
-
Choose Attach.
-
In the IAM listing that appears, search for the name of the role you created. Select the check box next to the name.
-
Choose Attach policy.
Get started using Studio
-
In the search bar at the top of the page, enter
SageMaker AI
. -
Under Services, select Amazon SageMaker AI.
-
Choose Studio from the navigation pane.
-
Choose your domain from the Get Started section, after expanding the down arrow under Select Domain.
-
Choose your user profile from the Get Started section after expanding the down arrow under Select user profile.
-
Choose Open Studio to open the landing page for Studio.
-
Select the file browser from the navigation pane and navigate to the root directory.
-
Select Create notebook.
-
In the notebook environment dialog box that opens, select the Data Science 3.0 image.
-
Choose Select.
-
Install the
fmeval
package in your development environment, as shown in the following code example:!pip install fmeval
Note
Install the
fmeval
library into an environment that uses Python 3.10. For more information about requirements needed to runfmeval
, seefmeval
dependencies.
FMEval uses a high-level wrapper called ModelRunner
to compose
input, invoke and extract output from your model. The fmeval
package can evaluate any LLM, however the procedure to configure
ModelRunner
depends on what kind of model you want to
evaluate. This section explains how to configure ModelRunner
for a JumpStart or Amazon Bedrock model. If you want to use a custom input dataset
and custom ModelRunner
, see Customize your workflow using the fmeval library.
Use a JumpStart model
To use ModelRunner
to evaluate a JumpStart model,
create or provide an endpoint, define the model and the built-in
dataset, configure, and test ModelRunner
.
Define a JumpStart model and configure a ModelRunner
-
Provide an endpoint by doing either of the following:
-
Specify the EndpointName to an existing JumpStart endpoint, the
model_id
, andmodel_version
. -
Specify the
model_id
andmodel_version
for your model, and create a JumpStart endpoint.
The following code example shows how create an endpoint for a Llama 2 foundation model
that's available through JumpStart. import sagemaker from sagemaker.jumpstart.model import JumpStartModel #JumpStart model and version model_id, model_version = "meta-textgeneration-llama-2-7b-f", "*" my_model = JumpStartModel(model_id=model_id) predictor = my_model.deploy() endpoint_name = predictor.endpoint_name # Accept the EULA, and test the endpoint to make sure it can predict. predictor.predict({"inputs": [[{"role":"user", "content": "Hello how are you?"}]]}, custom_attributes='accept_eula=true')
The previous code example refers to EULA, which stands for end-use-license-agreement (EULA). The EULA can be found in the model card description of the model that you are using. To use some JumpStart models, you must specify
accept_eula=true
, as shown in the previous call topredict
. For more information about EULA, see the Licenses and model sources section in Model sources and license agreements .You can find a list of available JumpStart models at Built-in Algorithms with pre-trained Model Table
. -
-
Configure
ModelRunner
by using theJumpStartModelRunner
, as shown in the following configuration example:from fmeval.model_runners.sm_jumpstart_model_runner import JumpStartModelRunner js_model_runner = JumpStartModelRunner( endpoint_name=endpoint_name, model_id=model_id, model_version=model_version )
In the previous configuration example, use the same values for
endpoint_name
,model_id
, andmodel_version
that you used to create the endpoint. -
Test your
ModelRunner
. Send a sample request to your model as shown in the following code example:js_model_runner.predict("What is the capital of London")
Use an Amazon Bedrock model
To evaluate an Amazon Bedrock model, you must define the model and built-in
dataset, and configure ModelRunner
.
Define an Amazon Bedrock model and configure a ModelRunner
-
To define and print model details, use the following code example for a Titan model that is available through Amazon Bedrock:
import boto3 import json bedrock = boto3.client(service_name='bedrock') bedrock_runtime = boto3.client(service_name='bedrock-runtime') model_id = "amazon.titan-tg1-large" accept = "application/json" content_type = "application/json" print(bedrock.get_foundation_model(modelIdentifier=modelId).get('modelDetails'))
In the previous code example, the
accept
parameter specifies the format of the data that you want to use to evaluate your LLM. ThecontentType
specifies the format of the input data in the request. OnlyMIME_TYPE_JSON
is supported foraccept
andcontentType
for Amazon Bedrock models. For more information about these parameters, see InvokeModelWithResponseStream. -
To configure
ModelRunner
, use theBedrockModelRunner
, as shown in the following configuration example:from fmeval.model_runners.bedrock_model_runner import BedrockModelRunner bedrock_model_runner = BedrockModelRunner( model_id=model_id, output='results[0].outputText', content_template='{"inputText": $prompt, "textGenerationConfig": \ {"maxTokenCount": 4096, "stopSequences": [], "temperature": 1.0, "topP": 1.0}}', )
Parametrize the
ModelRunner
configuration as follows.-
Use the same values for
model_id
that you used to deploy the model. -
Use
output
to specify the format of the generatedjson
response. As an example, if your LLM provided the response[{"results": "this is the output"}]
, thenoutput='results[0].outputText'
returnsthis is the output
. -
Use
content_template
to specify how your LLM interacts with requests. The following configuration template is detailed solely to explain the previous configuration example, and it's not required.-
In the previous configuration example, the variable
inputText
specifies the prompt, which captures the request made by the user. -
The variable
textGenerationConfig
specifies how the LLM generates responses as follows:-
The parameter
maxTokenCount
is used to limit the length of the response by limiting the number of tokens returned by the LLM. -
The parameter
stopSequences
is used to specify a list of character sequences that tell your LLM to stop generating a response. The model output is stopped the first time any of the listed strings are encountered in the output. As an example, you can use a carriage return sequence to limit the model response to a single line. -
The parameter
topP
controls the randomness by limiting the set of tokens to consider when generating the next token. This parameter accepts values between0.0
and1.0
. Higher values oftopP
allow for a set containing a broader vocabulary and lower values restrict the set of tokens to more probable words. -
The parameter
temperature
controls the randomness of the generated text, and accepts positive values. Higher values oftemperature
instruct the model to generate more random and diverse responses. Lower values generate responses that are more predictable. Typical ranges fortemperature
lie between0.2
and2.0
.
For more information about parameters for a specific Amazon Bedrock foundation model, see Inference parameters for foundation models.
-
The format of the content_template parameter depends on the inputs and parameters supported by your LLM. For example, Anthropic’s Claude 2 model
can support the following content_template
:"content_template": "{\"prompt\": $prompt, \"max_tokens_to_sample\": 500}"
As another example, the Falcon 7b model
can support the following content_template
."content_template": "{\"inputs\": $prompt, \"parameters\":{\"max_new_tokens\": \ 10, \"top_p\": 0.9, \"temperature\": 0.8}}"
Lastly, test your
ModelRunner
. Send a sample request to your model as shown in the following code example:bedrock_model_runner.predict("What is the capital of London?")
-
-
After you configure your data and ModelRunner
, you can run an
evaluation algorithm on the responses generated by your LLM. To see a list
of all of the available evaluation algorithms, run the following
code:
from fmeval.eval_algo_mapping import EVAL_ALGORITHMS print(EVAL_ALGORITHMS.keys())
Each algorithm has both an evaluate and an evaluate_sample
method. The evaluate
method computes a score for the entire
dataset. The evaluate_sample
method evaluates the score for a
single instance.
The evaluate_sample
method returns EvalScore
objects. EvalScore
objects contain aggregated scores of how
well your model performed during evaluation. The
evaluate_sample
method has the following optional
parameters:
-
model_output
– The model response for a single request. -
model_input
– A prompt containing the request to your model. -
target_output
– The expected response from the prompt contained inmodel_input
.
The following code example shows how to use the
evaluate_sample
:
#Evaluate your custom sample model_output = model_runner.predict("London is the capital of?")[0] eval_algo.evaluate_sample(target_output="UK<OR>England<OR>United Kingdom", model_output=model_output)
The evaluate
method has the following optional
parameters:
-
model
– An instance ofModelRunner
using the model that you want to evaluate. -
dataset_config
– The dataset configuration. Ifdataset_config
is not provided, the model is evaluated using all of the built-in datasets that are configured for this task. -
prompt_template
– A template used to generate prompts. Ifprompt_template
is not provided, your model is evaluated using a default prompt template. -
save
– If set toTrue
, record-wise prompt responses and scores are saved to the fileEvalAlgorithmInterface.EVAL_RESULTS_PATH
. Defaults toFalse
. -
num_records
– The number of records that are sampled randomly from the input dataset for evaluation. Defaults to300
.
The evaluate
algorithm returns a list of
EvalOutput
objects that can include the following:
-
eval_name
– The name of the evaluation algorithm.dataset_name
– The name of dataset used by the evaluation algorithm.prompt_template
– A template used to compose prompts that is consumed if the parametermodel_output
is not provided in the dataset. For more information, seeprompt_template
in the Configure a JumpStartModelRunner
section.dataset_scores
– An aggregated score computed across the whole dataset.category_scores
– A list ofCategoryScore
objects that contain the scores for each category in the dataset.output_path
– The local path to the evaluation output. This output contains prompt-responses with record-wise evaluation scores.error
– A string error message for a failed evaluation job.
The following dimensions are available for model evaluation:
-
Accuracy
-
Factual knowledge
-
Prompt stereotyping
-
Semantic robustness
-
Toxicity
Accuracy
You can run an accuracy algorithm for a question answering, text summarization, or classification task. The algorithms are different for each task in order to accommodate the different data input types and problems as follows:
-
For question answering tasks, run the
QAAccuracy
algorithm with aQAAccuracyConfig
file. -
For text summarization tasks, run the
SummarizationAccuracy
algorithm with aSummarizationAccuracyConfig
. -
For classification tasks, run the
ClassificationAccuracy
algorithm with aClassificationAccuracyConfig
.
The QAAccuracy
algorithm returns a list of
EvalOutput
objects that contains one accuracy score for
each sample. To run the question answer accuracy algorithm, instantiate
a QAAccuracygeConfig
and pass in either
<OR>
or None
as the
target_output_delimiter
. The question answer accuracy
algorithm compares the response that your model generates with a known
response. If you pass in <OR>
as the target
delimiter, then the algorithm scores the response as correct if it
generates any of the content separated by <OR>
in the
answer. If you pass None
or an empty string as the
target_output_delimiter
, the code throws an
error.
Call the evaluate
method and pass in your desired
parameters as shown in the following code example:
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.qa_accuracy import QAAccuracy, QAAccuracyConfig eval_algo = QAAccuracy(QAAccuracyConfig(target_output_delimiter="<OR>"))) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
The SummarizationAccuracy
algorithm returns a list of
EvalOutput
objects that contain scores for ROUGE-NSummarizationAccuracyConfig
and pass in the
following:
-
Specify the type of ROUGE
metric you want to use in your evaluation to rouge_type
. You can chooserouge1
,rouge2
, orrougeL
. These metrics compare generated summaries to reference summaries. ROUGE-1 compares the generated summaries and reference summaries using overlapping unigrams (sequences of one item such as “the”, “is”). ROUGE-2 compares the generated and reference summaries using bigrams (groups of two sequences such as “the large”, “is home”). ROUGE-L compares the longest matching sequence of words. For more information about ROUGE, see ROUGE: A Package for Automatic Evaluation of Summaries. -
Set
use_stemmer_for_rouge
toTrue
orFalse
. A stemmer removes affixes from words before comparing them. For example, a stemmer removes the affixes from “swimming” and “swam” so that they are both “swim” after stemming. -
Set model_type_for_bertscore to the model that you want to use to calculate a BERTScore
. You can choose ROBERTA_MODEL or the more advanced MICROSOFT_DEBERTA_MODEL .
Lastly, call the evaluate
method and pass in your desired
parameters as shown in the following code example:
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.summarization_accuracy import SummarizationAccuracy, SummarizationAccuracyConfig eval_algo = SummarizationAccuracy(SummarizationAccuracyConfig(rouge_type="rouge1",model_type_for_bertscore="MICROSOFT_DEBERTA_MODEL")) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
The ClassificationAccuracy
algorithm returns a list of
EvalOutput
objects that contain the classification
accuracy, precision, recall, and balanced accuracy scores for each
sample. For more information about these scores, see the Classification section in Using prompt datasets and available evaluation dimensions in model evaluation jobs . To run
the classification accuracy algorithm, instantiate a
ClassificationAccuracyConfig
and pass in an averaging
strategy to multiclass_average_strategy
. You can choose
micro
, macro
, samples
,
weighted
, or binary
. The default value is
micro
. Then, pass in a list containing the names of the
columns that contain the true labels for your classification categories
to valid_labels. Lastly, call the evaluate
method and pass
in your desired parameters as shown in the following code
example:
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.classification_accuracy import ClassificationAccuracy, ClassificationAccuracyConfig eval_algo = ClassificationAccuracy(ClassificationAccuracyConfig(multiclass_average_strategy="samples",valid_labels=["animal_type","plant_type","fungi_type"])) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
Factual knowledge
You can run the factual knowledge algorithm for open-ended generation.
To run the factual knowledge algorithm, instantiate a
FactualKnowledgeConfig
and optionally pass a delimiter
string (by default, this is <OR>
). The factual
knowledge algorithm compares the response that your model generates with
a known response. The algorithm scores the response as correct if it
generates any of the content separated by the delimiter in the answer.
If you pass None
as the
target_output_delimiter
, then the model must generate
the same response as the answer to be scored as correct. Lastly, call
the evaluate
method and pass in your desired
parameters.
Factual knowledge returns a list of EvalScore
objects.
These contain aggregated scores on how well your model is able to encode
factual knowledge as described in the Foundation
model evaluation overview section. The scores range
between 0
and 1
with the lowest score
corresponding to a lower knowledge of real-world facts.
The following code example shows how to evaluate your LLM using the factual knowledge algorithm:
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.factual_knowledge import FactualKnowledge, FactualKnowledgeConfig eval_algo = FactualKnowledge(FactualKnowledgeConfig()) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
Prompt stereotyping
You can run the prompt stereotyping algorithm for open-ended
generation. To run the prompt stereotyping algorithm, your
DataConfig
must identify the columns in your input
dataset that contain a less stereotypical sentence in
sent_less_input_location
and a more stereotypical
sentence in sent_more_output_location
. For more information
about DataConfig
, see the previous section 2. Configure ModelRunner
. Next,
call the evaluate
method and pass in your desired
parameters.
Prompt stereotyping returns a list of EvalOutput
objects
that contain a score for each input record and overall scores for each
type of bias. The scores are calculated by comparing the probability of
the more and less stereotypical sentences. The overall score reports how
often the model preferred the stereotypical sentence in that the model
assigns a higher probability to the more stereotypical compared to the
less stereotypical sentence. A score of 0.5
indicates that
your model is unbiased, or that it prefers more and less stereotypical
sentences at equal rates. A score of greater than 0.5
indicates that your model is likely to generate a response that is more
stereotypical. Scores less than 0.5
indicate that your
model is likely to generate a response that is less
stereotypical.
The following code example shows how to evaluate your LLM using the prompt stereotyping algorithm:
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.prompt_stereotyping import PromptStereotyping eval_algo = PromptStereotyping() eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
Semantic robustness
You can run a semantic robustness algorithm for any FMEval task, however your model should be deterministic. A deterministic model is one that always generate the same output for the same input. One may typically achieve determinism by setting a random seed in the decoding process. The algorithms are different for each task in order to accommodate the different data input types and problems as follows:
-
For open-ended generation, question answering, or task classification run the
GeneralSemanticRobustness
algorithm with aGeneralSemanticRobustnessConfig
file. -
For text summarization, run the
SummarizationAccuracySemanticRobustness
algorithm with aSummarizationAccuracySemanticRobustnessConfig
file.
The GeneralSemanticRobustness
algorithm returns a list of
EvalScore
objects that contain accuracy with values
between 0
and 1
quantifying the difference
between the perturbed and unperturbed model outputs. To run the general
semantic robustness algorithm, instantiate a
GeneralSemanticRobustnessConfig
and pass in a
perturbation_type
. You can choose one of the following
for perturbation_type
:
-
Butterfinger
– A perturbation that mimics spelling mistakes using character swaps based on keyboard distance. Input a probability that a given character is perturbed. Butterfinger is the default value forperturbation_type
. -
RandomUpperCase
– A perturbation that changes a fraction of characters to uppercase. Input a decimal from0
to1
. -
WhitespaceAddRemove
– The probability that a white space character is added in front of a non-white space character into white.
You can also specify the following parameters:
-
num_perturbations
– The number of perturbations for each sample to introduce into the generated text. The default is5
. -
butter_finger_perturbation_prob
– The probability that a character is be perturbed. Used only whenperturbation_type
isButterfinger
. The default is0.1
. -
random_uppercase_corrupt_proportion
– The fraction of characters to be changed to uppercase. Used only whenperturbation_type
isRandomUpperCase
. The default is0.1
. -
whitespace_add_prob
– Given a white space, the probability of removing it from a sample. Used only whenperturbation_type
isWhitespaceAddRemove
. The default is0.05
. -
whitespace_remove_prob
– Given a non-white space, the probability of adding a white space in front of it. Used only whenperturbation_type
isWhitespaceAddRemove
. The default is0.1
.
Lastly, call the evaluate
method and pass in your desired
parameters as shown in the following code example:
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.general_semantic_robustness import GeneralSemanticRobustness, GeneralSemanticRobustnessConfig eval_algo = GeneralSemanticRobustness(GeneralSemanticRobustnessConfig(perturbation_type="RandomUpperCase",num_perturbations=2,random_uppercase_corrupt_proportion=0.3))) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
The SummarizationAccuracySemanticRobustness
algorithm
returns a list of EvalScore
objects that contain the
difference (or delta) between the ROUGE-NSummarizationAccuracySemanticRobustnessConfig
and pass
in a perturbation_type
.
You can choose one of the following for
perturbation_type
:
-
Butterfinger
– A perturbation that mimics spelling mistakes using character swaps based on keyboard distance. Input a probability that a given character is perturbed.Butterfinger
is the default value forperturbation_type
. -
RandomUpperCase
– A perturbation that changes a fraction of characters to uppercase. Input a decimal from0
to1
. -
WhitespaceAddRemove
– Input a probability that a white space character is added in front of a non-white space character into white.
You can also specify the following parameters:
-
num_perturbations
– The number of perturbations for each sample to introduce into the generated text. Default is5
. -
butter_finger_perturbation_prob
– The probability that a character is perturbed. Used only whenperturbation_type
isButterfinger
. Default is0.1
. -
random_uppercase_corrupt_proportion
– The fraction of characters to be changed to uppercase. Used only whenperturbation_type
isRandomUpperCase
. Default is0.1
. -
whitespace_add_prob
– Given a white space, the probability of removing it from a sample. Used only whenperturbation_type
isWhitespaceAddRemove
. Default is0.05
. -
whitespace_remove_prob
– Given a non-white space, the probability of adding a white space in front of it. Used only whenperturbation_type
isWhitespaceAddRemove
, Default is0.1
. -
rouge_type
– Metrics that compare generated summaries to reference summaries. Specify the type of ROUGEmetric you want to use in your evaluation to rouge_type
. You can chooserouge1
,rouge2
, orrougeL
. ROUGE-1 compares the generated summaries and reference summaries using overlapping unigrams (sequences of one item such as “the”, “is”). ROUGE-2 compares the generated and reference summaries using bigrams (groups of two sequences such as “the large”, “is home”). ROUGE-L compares the longest matching sequence of words. For more information about ROUGE, see ROUGE: A Package for Automatic Evaluation of Summaries. -
Set
user_stemmer_for_rouge
toTrue
orFalse
. A stemmer removes affixes from words before comparing them. For example, a stemmer removes the affixes from “swimming” and “swam” so that they are both “swim” after stemming. -
Set
model_type_for_bertscore
to the model that you want to use to calculate a BERTScore. You can choose ROBERTA_MODEL or the more advanced MICROSOFT_DEBERTA_MODEL . Call the
evaluate
method and pass in your desired parameters as shown in the following code example:from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.summarization_accuracy_semantic_robustness import SummarizationAccuracySemanticRobustness, SummarizationAccuracySemanticRobustnessConfig eval_algo = SummarizationAccuracySemanticRobustness(SummarizationAccuracySemanticRobustnessConfig(perturbation_type="Butterfinger",num_perturbations=3,butter_finger_perturbation_prob=0.2))) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)
Toxicity
You can run the a toxicity algorithm for open-ended generation, text summarization, or question answering. There are three distinct classes depending on the task.
-
For open-ended generation, run the Toxicity algorithm with a
ToxicityConfig
file. -
For summarization, use the class
Summarization_Toxicity
. -
For question answering, use the class
QAToxicity
.
The toxicity algorithm returns one or more a list of
EvalScore
objects (depending on the toxicity detector)
that contain scores between 0
and 1
. To run
the toxicity algorithm, instantiate a ToxicityConfig
and
pass in a toxicity model to use to evaluate your model against in
model_type
. You can choose the following for
model_type
:
-
`detoxify` for UnitaryAI Detoxify-unbiased
, a multilabel text classifier trained on Toxic Comment Classification Challenge and Jigsaw Unintended Bias in Toxicity Classification . The model provides 7
scores for the following classes: toxicity, severe toxicity, obscenity, threat, insult, sexual explicity and identity attack.The following is example output from the detoxity model:
EvalScore(name='toxicity', value=0.01936926692724228), EvalScore(name='severe_toxicity', value=3.3755677577573806e-06), EvalScore(name='obscene', value=0.00022437423467636108), EvalScore(name='identity_attack', value=0.0006707844440825284), EvalScore(name='insult', value=0.005559926386922598), EvalScore(name='threat', value=0.00016682750720065087), EvalScore(name='sexual_explicit', value=4.828436431125738e-05)
-
`toxigen` for Toxigen-roberta
, a binary RoBERTa-based text classifier fine-tuned on the ToxiGen dataset, which contains sentences with subtle and implicit toxicity pertaining to 13
minority groups.
Lastly, call the evaluate
method and pass in your desired
parameters as shown in the following code example.
from fmeval.eval import get_eval_algorithm from fmeval.eval_algorithms.toxicity import Toxicity, ToxicityConfig eval_algo = Toxicity(ToxicityConfig(model_type="detoxify")) eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config, prompt_template="$feature", save=True)