人間による評価ジョブの結果を理解する

ヒューマンワーカーを使用するモデル評価ジョブを作成する際は、単数または複数のメトリクスタイプを選択することは説明しました。作業チームのメンバーがワーカーポータルで応答を評価すると、その応答は humanAnswers JSON オブジェクトに保存されます。このような応答の保存方法は、ジョブの作成時に選択したメトリクスタイプによって異なります。

以下のセクションでは、これらの違いを説明し、例を紹介します。

JSON 出力のリファレンス

モデル評価ジョブが完了すると、結果は JSON ファイルとして Amazon S3 に保存されます。JSON オブジェクトには、humanEvaluationResult、inputRecord、modelResponses の 3 つの高レベルノードが含まれます。humanEvaluationResult キーは、モデル評価ジョブに割り当てられた作業チームからの応答を含む高レベルノードです。inputRecord キーは、モデル評価ジョブの作成時にモデルに提供されるプロンプトを含む高レベルノードです。modelResponses キーは、モデルからのプロンプトに対する応答を含む高レベルノードです。

次の表では、モデル評価ジョブからの JSON 出力で確認できるキー値のペアをまとめています。

以降のセクションでは、各キー値のペアについて、さらに詳しく説明します。

パラメータ	例	説明
`flowDefinitionArn`	`arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name`	ヒューマンループを作成した、人間によるレビューワークフロー (フロー定義) の ARN
`humanAnswers`	選択した評価メトリクスに固有の JSON オブジェクトのリスト。詳細については、「humanAnswers で確認できるキー値のペア」を参照してください。	ワーカーの応答を含む JSON オブジェクトのリスト
`humanLoopName`	`system-generated-hash`	システムは 40 文字の 16 進数文字列を生成。
`inputRecord`	`"inputRecord": { "prompt": { "text": "Who invented the airplane?" }, "category": "Airplanes", "referenceResponse": { "text": "Orville and Wilbur Wright" }, "responses": [{ "modelIdentifier": "meta-textgeneration-llama-codellama-7b", "text": "The Wright brothers, Orville and Wilbur Wright are widely credited with inventing and manufacturing the world's first successful airplane." }] }`	入力データセットからのエントリプロンプトを含む JSON オブジェクト。
`modelResponses`	`"modelResponses": [{ "modelIdentifier": "arn:aws:bedrock:us-west-2::foundation-model/model-id", "text": "the-models-response-to-the-prompt" }]`	モデルからの個々のレスポンス。
`inputContent`	`{ "additionalDataS3Uri":"s3://user-specified-S3-URI-path/datasets/dataset-name/records/record-number/human-loop-additional-data.json", "evaluationMetrics":[ { "description":"brief-name", "metricName":"metric-name", "metricType":"IndividualLikertScale" } ], "instructions":"example instructions" }`	Amazon S3 バケットでヒューマンループを開始するために必要なヒューマンループ入力コンテンツ
`modelResponseIdMap`	`{ "0": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612", "1": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352" }`	`answerContent` での各モデルの提示方法の説明

`humanEvaluationResult` で確認できるキー値のペア

次のキー値のペアは、モデル評価ジョブの出力の humanEvaluationResult の下にあります。

humanAnswers に関連付けられたキー値のペアについては、「humanAnswers で確認できるキー値のペア」を参照してください。

flowDefinitionArn

モデル評価ジョブの実行に使用されるフロー定義の ARN
例:arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name

humanLoopName

システムは 40 文字の 16 進数文字列を生成。

inputContent

このキー値は、メトリクスタイプと、ワーカーポータルでワーカーに提供した指示の説明です。
- additionalDataS3Uri: ワーカーへの指示が保存される Amazon S3 の場所
- instructions: ワーカーポータルでワーカーに提供した手順
- evaluationMetrics: メトリクスの名前と説明。キー値 metricType は、モデルの応答を評価するためにワーカーに提供されるツールです。

modelResponseIdMap

このキー値ペアは、選択したモデルの完全な名前と、ワーカーの選択がキー値ペア humanAnswers のモデルにどのようにマッピングされるかを特定します。

`inputRecord` で確認できるキー値のペア

次のエントリでは、キー値のペア inputRecord について説明します。

prompt

モデルに送信されたプロンプトのテキスト

category

プロンプトを分類するオプションのカテゴリ。モデル評価中にワーカーポータルのワーカーに表示されます。
例:"American cities"

referenceResponse

評価中にワーカーが参照するグラウンドトゥルースを指定するために使用される入力 JSON のオプションフィールド

responses

他のモデルからの応答を含む入力 JSON のオプションフィールド

JSON 入力レコードの例


{
  "prompt": {
     "text": "Who invented the airplane?"
  },
  "category": "Airplanes",
  "referenceResponse": {
    "text": "Orville and Wilbur Wright"
  },
  "responses":
    // The same modelIdentifier must be specified for all responses
    [{
      "modelIdentifier": "meta-textgeneration-llama-codellama-7b" ,
      "text": "The Wright brothers, Orville and Wilbur Wright are widely credited with inventing and manufacturing the world's first successful airplane."
    }]
}

`modelResponses` で確認できるキー値のペア

モデルからの応答と、応答を提供したモデルを含むキー値ペアの配列

text

プロンプトに対するモデルの応答

modelIdentifier

モデルの名前です。

`humanAnswers` で確認できるキー値のペア

モデルからの応答とワーカーがモデルを評価した方法を含むキー値ペアの配列

acceptanceTime

ワーカーポータルでワーカーがタスクを受け入れた時刻

submissionTime

ワーカーが応答を送信した時刻

timeSpentInSeconds

ワーカーがタスクの完了までに費やした期間

workerId

タスクを完了したワーカーの ID

workerMetadata

このモデル評価ジョブに割り当てられた作業チームに関するメタデータ

`answerContent` JSON 配列の形式

回答の構造は、モデル評価ジョブの作成時に選択された評価メトリクスによって異なります。各ワーカーの応答または回答は、新しい JSON オブジェクトに記録されます。

answerContent

evaluationResults にはワーカーの応答が含まれます。
- [選択ボタン] をクリックすると、各ワーカーの結果は "evaluationResults": "comparisonChoice" になります。
  
  metricName: メトリクスの名前
  
  result: この JSON オブジェクトは、ワーカーが 0 または 1 を使用して選択したモデルを示します。モデルがどの値を表示するようにマッピングされているかを確認するには、「modelResponseIdMap」を参照してください。
- [リッカート尺度 - 比較] をクリックすると、各ワーカーの結果は "evaluationResults": "comparisonLikertScale" になります。
  
  metricName: メトリクスの名前
  
  leftModelResponseId: ワーカーポータルの左側に表示される modelResponseIdMap を示します。
  
  rightModelResponseId: ワーカーポータルの左側に表示される modelResponseIdMap を示します。
  
  result: この JSON オブジェクトは、ワーカーが 0 または 1 を使用して選択したモデルを示します。モデルがどの値を表示するようにマッピングされているかを確認するには、「modelResponseIdMap」を参照してください
- [順序ランク] をクリックすると、各ワーカーの結果は "evaluationResults": "comparisonRank" になります。
  
  metricName: メトリクスの名前
  
  result: JSONオブジェクトの配列各モデル (modelResponseIdMap) に対して、ワーカーは rank を提供します。
```
"result": [{
	"modelResponseId": "0",
	"rank": 1
}, {
	"modelResponseId": "1",
	"rank": 1
}]
```
- 単一のモデル応答の評価であるリッカート尺度が選択されると、ワーカーの結果は "evaluationResults": "individualLikertScale" に保存されます。これは、ジョブの作成時に指定された metricName のスコアを含む JSON 配列です。
  
  metricName: メトリクスの名前
  
  modelResponseId: スコアリングされるモデルモデルがどの値を表示するようにマッピングされているかを確認するには、「modelResponseIdMap」を参照してください。
  
  result: ワーカーが選択したリッカート尺度を示すキー値のペア
- [親指を上げる/下げる] を選択すると、ワーカーの結果は JSON 配列 "evaluationResults": "thumbsUpDown" として保存されます。
  
  metricName: メトリクスの名前
  
  result: metricName に関連する true または false のいずれか。ワーカーが [親指を上げる] を選択した場合は、"result" : true

モデル評価ジョブ出力からの出力の例

次の JSON オブジェクトは、Amazon S3 に保存されるモデル評価ジョブ出力の例です。各キー値のペアの詳細については、「JSON 出力のリファレンス」を参照してください。

理解しやすいように、このジョブには 2 人のワーカーからの応答のみが含まれています。キー値のペアによっては、読みやすいように、省略されているものもあります。


{
	"humanEvaluationResult": {
		"flowDefinitionArn": "arn:aws:sagemaker:us-west-2:111122223333:flow-definition/flow-definition-name",
        "humanAnswers": [
            {
                "acceptanceTime": "2024-06-07T22:31:57.066Z",
                "answerContent": {
                    "evaluationResults": {
                        "comparisonChoice": [
                            {
                                "metricName": "Fluency",
                                "result": {
                                    "modelResponseId": "0"
                                }
                            }
                        ],
                        "comparisonLikertScale": [
                            {
                                "leftModelResponseId": "0",
                                "metricName": "Coherence",
                                "result": 1,
                                "rightModelResponseId": "1"
                            }
                        ],
                        "comparisonRank": [
                            {
                                "metricName": "Toxicity",
                                "result": [
                                    {
                                        "modelResponseId": "0",
                                        "rank": 1
                                    },
                                    {
                                        "modelResponseId": "1",
                                        "rank": 1
                                    }
                                ]
                            }
                        ],
                        "individualLikertScale": [
                            {
                                "metricName": "Correctness",
                                "modelResponseId": "0",
                                "result": 2
                            },
                            {
                                "metricName": "Correctness",
                                "modelResponseId": "1",
                                "result": 3
                            },
                            {
                                "metricName": "Completeness",
                                "modelResponseId": "0",
                                "result": 1
                            },
                            {
                                "metricName": "Completeness",
                                "modelResponseId": "1",
                                "result": 4
                            }
                        ],
                        "thumbsUpDown": [
                            {
                                "metricName": "Accuracy",
                                "modelResponseId": "0",
                                "result": true
                            },
                            {
                                "metricName": "Accuracy",
                                "modelResponseId": "1",
                                "result": true
                            }
                        ]
                    }
                },
                "submissionTime": "2024-06-07T22:32:19.640Z",
                "timeSpentInSeconds": 22.574,
                "workerId": "ead1ba56c1278175",
                "workerMetadata": {
                    "identityData": {
                        "identityProviderType": "Cognito",
                        "issuer": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_WxGLvNMy4",
                        "sub": "cd2848f5-6105-4f72-b44e-68f9cb79ba07"
                    }
                }
            },
            {
                "acceptanceTime": "2024-06-07T22:32:19.721Z",
                "answerContent": {
                    "evaluationResults": {
                        "comparisonChoice": [
                            {
                                "metricName": "Fluency",
                                "result": {
                                    "modelResponseId": "1"
                                }
                            }
                        ],
                        "comparisonLikertScale": [
                            {
                                "leftModelResponseId": "0",
                                "metricName": "Coherence",
                                "result": 1,
                                "rightModelResponseId": "1"
                            }
                        ],
                        "comparisonRank": [
                            {
                                "metricName": "Toxicity",
                                "result": [
                                    {
                                        "modelResponseId": "0",
                                        "rank": 2
                                    },
                                    {
                                        "modelResponseId": "1",
                                        "rank": 1
                                    }
                                ]
                            }
                        ],
                        "individualLikertScale": [
                            {
                                "metricName": "Correctness",
                                "modelResponseId": "0",
                                "result": 3
                            },
                            {
                                "metricName": "Correctness",
                                "modelResponseId": "1",
                                "result": 4
                            },
                            {
                                "metricName": "Completeness",
                                "modelResponseId": "0",
                                "result": 1
                            },
                            {
                                "metricName": "Completeness",
                                "modelResponseId": "1",
                                "result": 5
                            }
                        ],
                        "thumbsUpDown": [
                            {
                                "metricName": "Accuracy",
                                "modelResponseId": "0",
                                "result": true
                            },
                            {
                                "metricName": "Accuracy",
                                "modelResponseId": "1",
                                "result": false
                            }
                        ]
                    }
                },
                "submissionTime": "2024-06-07T22:32:57.918Z",
                "timeSpentInSeconds": 38.197,
                "workerId": "bad258db224c3db6",
                "workerMetadata": {
                    "identityData": {
                        "identityProviderType": "Cognito",
                        "issuer": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_WxGLvNMy4",
                        "sub": "84d5194a-3eed-4ecc-926d-4b9e1b724094"
                    }
                }
            }
        ],
        "humanLoopName": "a757 11d3e75a 8d41f35b9873d 253f5b7bce0256e",
        "inputContent": {
            "additionalDataS3Uri": "s3://mgrt-test-us-west-2/test-2-workers-2-model/datasets/custom_dataset/0/task-input-additional-data.json",
            "instructions": "worker instructions provided by the model evaluation job administrator",
            "evaluationMetrics": [
                {
                    "metricName": "Fluency",
                    "metricType": "ComparisonChoice",
                    "description": "Measures the linguistic quality of a generated text."
                },
                {
                    "metricName": "Coherence",
                    "metricType": "ComparisonLikertScale",
                    "description": "Measures the organization and structure of a generated text."
                },
                {
                    "metricName": "Toxicity",
                    "metricType": "ComparisonRank",
                    "description": "Measures the harmfulness of a generated text."
                },
                {
                    "metricName": "Accuracy",
                    "metricType": "ThumbsUpDown",
                    "description": "Indicates the accuracy of a generated text."
                },
                {
                    "metricName": "Correctness",
                    "metricType": "IndividualLikertScale",
                    "description": "Measures a generated answer's satisfaction in the context of the question."
                },
                {
                    "metricName": "Completeness",
                    "metricType": "IndividualLikertScale",
                    "description": "Measures a generated answer's inclusion of all relevant information."
                }
            ],
            "disableRandomization": "true"
        },
        "modelResponseIdMap": {
            "0": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612",
            "1": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352"
        }
    },
    "inputRecord": {
        "prompt": {
            "text": "What is high intensity interval training?"
        },
        "category": "Fitness",
        "referenceResponse": {
            "text": "High-Intensity Interval Training (HIIT)"
        }
    },
    "modelResponses": [
        {
            "text": "High Intensity Interval Training (HIIT) is a form of exercise that alternates between periods of high intensity work and low intensity recovery.HIIT is an excellent way to increase your fitness and improve your health, but it can be difficult to get started.In this article, we will",
            "modelIdentifier": "sm-margaret-meta-textgeneration-llama-2-7b-1711485008-0612"
        },
        {
            "text": "High intensity interval training is a form of exercise consisting of short bursts of maximum effort followed by periods of rest. The theory behind HIIT is that it can often be more effective at improving cardiovascular and metabolic health than longer, lower intensity workouts.The work intervals can range in length depending on the specific type of exercise, but are typically between 20 and 90 seconds. The recovery periods are generally longer, lasting between 1 and 5 minutes. This pattern is then repeated for multiple sets.\n\nSince the work intervals are high intensity, they require more effort from your body and therefore result in a greater calorie burn. The body also continues to burn calories at an increased rate after the workout due to an effect called excess post exercise oxygen consumption (EPOC), also know as the afterburn effect.\n\nHIIT is a versatile form of training that can be adapted to different fitness levels and can be performed using a variety of exercises including cycling, running, bodyweight movements, and even swimming. It can be done in as little as 20 minutes once or twice a week, making it an efficient option for busy individuals.\n\nWhat are the benefits of high intensity interval training",
            "modelIdentifier": "jumpstart-dft-hf-llm-mistral-7b-ins-20240327-043352"
        }
    ]
}

ブラウザで JavaScript が無効になっているか、使用できません。

AWS ドキュメントを使用するには、JavaScript を有効にする必要があります。手順については、使用するブラウザのヘルプページを参照してください。

ドキュメントの表記規則

ジョブの結果

自動評価ジョブの結果を理解する

人間による評価ジョブの結果を理解する

JSON 出力のリファレンス

humanEvaluationResult で確認できるキー値のペア

inputRecord で確認できるキー値のペア

modelResponses で確認できるキー値のペア

humanAnswers で確認できるキー値のペア

answerContent JSON 配列の形式