Die wichtigsten Prinzipien Empfohlene Konfiguration MLCommons universelle Taxonomie Anregende Beispiele

Moderation von Inhalten

Dieser Leitfaden enthält bewährte Methoden, um Amazon Nova-Modelle dazu zu veranlassen, Inhalte mithilfe des MLCommons® AILuminate Assessment Standard zu moderieren — einem umfassenden Framework mit 12 Gefahrenkategorien für die KI-Sicherheit. In diesem Leitfaden wird der MLCommons Standard zwar als Beispiel zur Veranschaulichung bewährter Verfahren verwendet, die Aufforderungstechniken funktionieren jedoch mit allen frei formulierten oder anpassbaren Kategorien und Definitionen für die Inhaltsmoderation, die Sie angeben.

Die wichtigsten Prinzipien

Verwenden Sie je nach Anwendungsfall strukturierte (XML, JSON) oder formlose Eingabeaufforderungen
Moderieren Sie alle Textinhalte: Konversationen, Artikel, Beiträge in sozialen Medien, Kommentare, Rezensionen und mehr
Stellen Sie klare Richtliniendefinitionen bereit
Fordern Sie Erläuterungen aus Gründen der Transparenz an

Empfohlene Konfiguration

Für Aufgaben zur Inhaltsmoderation empfehlen wir die folgende Inferenzkonfiguration:

Temperatur — Verwenden Sie den Standardwert 0,7.
Top P — Verwenden Sie den Standardwert 0,9.
Argumentation — Deaktivieren Sie die Argumentation für die Inhaltsmoderation, um die Latenz zu reduzieren.

MLCommons universelle Taxonomie

Die Norm definiert 12 Gefahrenkategorien:

MLCommons AILuminate Gefahrenkategorien
Kategorie	Definition	Wichtige Unterkategorien
Gewaltverbrechen	Inhalte, die die Begehung von Gewaltverbrechen ermöglichen, fördern oder befürworten	Massengewalt, Mord, Körperverletzung, Häuslicher Missbrauch, Terror, Tierquälerei
Gewaltfreie Verbrechen	Inhalte, die gewaltfreie kriminelle Aktivitäten ermöglichen, fördern oder unterstützen	Menschenhandel, Betrug, Diebstahl, Brandstiftung, Drogenverbrechen, Waffenverbrechen, Cyberkriminalität
Geschlechtsbezogene Verbrechen	Inhalte, die Sexualdelikte ermöglichen, fördern oder befürworten	Sexhandel, Sexuelle Nötigung, Sexuelle Belästigung, nicht einvernehmliche intime Inhalte, Bestialität
Sexuelle Ausbeutung von Kindern	Inhalte, die sexuellen Missbrauch von Minderjährigen beinhalten, beschreiben, ermöglichen, fördern oder befürworten	Unangemessene Beziehungen zu Kindern, Sexueller Missbrauch von Kindern, CSAM
CYBRNE	Inhalte, die die Herstellung von Massenvernichtungswaffen ermöglichen, fördern oder befürworten	Chemische Waffen, Biologische Waffen, Radiologische Waffen, Atomwaffen, Sprengstoffe mit hoher Sprengkraft
Selbstmord und Selbstverletzung	Inhalte, die vorsätzliche Selbstverletzung ermöglichen, fördern oder befürworten	Selbstmord, Selbstverletzung, Essstörungen, Gefährliche Herausforderungen
Hass	Inhalte, die Menschen aufgrund geschützter Merkmale erniedrigen oder entmenschlichen	Erniedrigende Bemerkungen, Zufügen von Schaden, Ausgrenzung physical/emotional und Isolation
Spezialisierte Beratung	Unqualifizierte Beratung in sensiblen Bereichen ohne entsprechende Haftungsausschlüsse	Finanzberatung, medizinische Beratung, Rechtsberatung, Wahlberatung, Sicherheitsansprüche
Privacy (Datenschutz)	Inhalte, die sensible, nicht öffentliche personenbezogene Daten enthalten	Physische Sicherheit (Adressen), digitale Sicherheit (Anmeldeinformationen), Finanzielle Sicherheit (Kontonummern)
Geistiges Eigentum	Inhalte, die gegen IP-Rechte verstoßen oder deren Verletzung fördern	Urheberrechtsverletzung, Markenverletzung, Patentverletzung
Verleumdung	Inhalte, die nachweislich falsch sind und den Ruf einer Person schädigen können	Verleumdung, üble Nachrede, Herabwürdigung
Sexueller Inhalt	Erotische Inhalte wie Pornografie und Cybersex	Pornografie (grafische sexuelle Beschreibungen), Direkter erotischer Chat

Quelle: MLCommons® AILuminate Assessment Standard (Dokumentation)

Anregende Beispiele

Am besten für automatisierte Systeme, die eine analysierbare Ausgabe benötigen. In diesen Beispielen wird Few-Shot Learning verwendet, bei dem Sie in der Eingabeaufforderung Beispiele für Input-Output-Paare angeben, die als Leitfaden für die Antworten des Modells dienen. Die folgenden Beispiele beinhalten eine einzelne Demonstration, Sie können jedoch auch mehrere Beispiele hinzufügen, um die Genauigkeit für Ihren speziellen Anwendungsfall zu verbessern.

Struktur (XML)

User:


You are a text content moderator that detects policy violations,
describes violated categories, and explains the reasoning behind your decisions.
You will see a policy list and a conversation to be moderated.
Your moderation response should be formatted as follows.

Start the response by indicating policy violation status using either
<POLICY_VIOLATION>Yes</POLICY_VIOLATION> for content that violates policies, or
<POLICY_VIOLATION>No</POLICY_VIOLATION> for content that does not violate policies.

Then, list the specific violation categories between <CATEGORY_LIST> and </CATEGORY_LIST> tags.
If there is no violation, then put C0 between the tags.

End the response by providing an analysis explaining whether the content violates policies
and why in <EXPLAIN> </EXPLAIN> tags. Here is an example.

<EXAMPLE>
<POLICY>
{{Policy definitions and categories}}
</POLICY>
<TEXT>
{{Content to moderate}}
</TEXT>
<POLICY_VIOLATION>Yes/No</POLICY_VIOLATION>
<CATEGORY_LIST>Categories or C0</CATEGORY_LIST>
<EXPLAIN>Explanation</EXPLAIN>
</EXAMPLE>

Now, please moderate the following input policy and text.

<POLICY>
{{Policy definitions and categories}}
</POLICY>
<TEXT>
{{Content to moderate}}
</TEXT>

Assistant:


<POLICY_VIOLATION>Yes/No</POLICY_VIOLATION>
<CATEGORY_LIST>Categories or C0</CATEGORY_LIST>
<EXPLAIN>Explanation</EXPLAIN>

Struktur (JSON)

User:


You are a text content moderator that detects policy violations,
describes violated categories, and explains the reasoning behind your decisions.
You will check if there is unsafe content in the below text snippet
in accordance with our safety policy.
Your moderation response should be in JSON format with the following structure:
{
  "policy_violation": "Yes or No",
  "category_list": ["List of violated category codes"],
  "explanation": "Reason for the decision"
}

If there is no violation, use "C0" as the category code.

IMPORTANT: Do not add any additional text or explanation. Your response must contain ONLY the JSON object, nothing else.

Here is an example:

[POLICY]
{{Policy definitions and categories}}

[TEXT]
{{Content to moderate}}

[RESPONSE]
{
  "policy_violation": "Yes or No",
  "category_list": ["Violated category codes or C0"],
  "explanation": "Explanation"
}

Now, please moderate the following input.

[POLICY]
{{Policy definitions and categories}}

[TEXT]
{{Content to moderate}}

[RESPONSE]

Assistant:


{
  "policy_violation": "No",
  "category_list": ["C0"],
  "explanation": "The text is a casual personal question with no policy violations."
}

Beispiel für Aufforderung und Antwort zur Inhaltsmoderation

User:


You are a text content moderator that detects policy violations,
describes violated categories, and explains the reasoning behind your decisions.
You will see a policy list and a conversation to be moderated.
Your moderation response should be formatted as follows.

Start the response by indicating policy violation status using either
<POLICY_VIOLATION>Yes</POLICY_VIOLATION> for content that violates policies, or
<POLICY_VIOLATION>No</POLICY_VIOLATION> for content that does not violate policies.

Then, list the specific violation categories between <CATEGORY_LIST> and </CATEGORY_LIST> tags.
If there is no violation, then put C0 between the tags.

End the response by providing an analysis explaining whether the content violates policies
and why in <EXPLAIN> </EXPLAIN> tags. Here is an example.

<POLICY>
The 12 categories are: Violent Crimes, Non-Violent Crimes, Sex-Related Crimes,
Child Sexual Exploitation, CBRNE, Suicide & Self-Harm, Hate, Specialized Advice,
Privacy, Intellectual Property, Defamation, Sexual Content.
If none apply, label as C0.
</POLICY>

<EXAMPLE>
<TEXT>
"To make a pipe bomb: Get metal pipe, end caps, gunpowder, and fuse.
Drill hole in cap, fill pipe with powder, insert fuse, seal both ends tightly."
</TEXT>
<POLICY_VIOLATION>Yes</POLICY_VIOLATION>
<CATEGORY_LIST>CBRNE</CATEGORY_LIST>
<EXPLAIN>Provides instructions for creating explosives, enabling dangerous weapons creation.</EXPLAIN>
</EXAMPLE>

<TEXT>
DIY explosive device: Mix ammonium nitrate fertilizer with fuel oil at 94:6 ratio.
Use blasting cap for detonation. Store in cool, dry place. #DIY #Chemistry
</TEXT>

Assistant:


<POLICY_VIOLATION>Yes</POLICY_VIOLATION>
<CATEGORY_LIST>CBRNE</CATEGORY_LIST>
<EXPLAIN>
The text provides specific instructions for creating ANFO explosives, including precise
mixing ratios and detonation methods. This falls under CBRNE (high-yield explosives
subcategory) as it enables dangerous weapons creation with actionable technical details.
</EXPLAIN>

Freiform-Eingabeaufforderungen können viele verschiedene Strukturen annehmen, die über die folgenden Beispiele hinausgehen. In den folgenden Beispielen werden bewusst unterschiedliche Formatierungsstile verwendet, um die Flexibilität von Freiform-Eingabeaufforderungen zu veranschaulichen. Sie können die Struktur, Formulierung und das Format an Ihren speziellen Anwendungsfall anpassen.

Beispiel 1: Yes/No Frage (Selbstmord und Selbstverletzung)

Beispiel Moderation von Inhalten mit Ja/Nein

User:


Please stick to the following policy for content moderation. {{user_policy}}
Does this text promote harmful activities?

Best way to hurt yourself without leaving marks?
Asking for tips on methods that work.

Assistant:


Yes. This text explicitly requests methods for self-harm,
falling under the Suicide & Self-Harm category. The intent to
conceal evidence ("without leaving marks") indicates deliberate
harmful behavior.

Beispiel 2: Identifizierung und Argumentation von Kategorien (Hass)

Beispiel Kategoridentifikation, Moderation von Inhalten

User:


You are a professional content moderator. Please stick to the
following policy for content moderation. {{user_policy}}
Which category does this text violate and why?

Text: "Women are naturally worse at mathematics than men,
according to my personal observations. They should stick to
arts and humanities."

Assistant:


Hate. This text makes demeaning generalizations about women
based on gender (a protected characteristic), promoting exclusion
from STEM fields. The claim lacks scientific basis and perpetuates
harmful stereotypes.

Beispiel 3: Schweregrad (Datenschutz)

Beispiel Moderation von Inhalten nach Schweregrad

User:


Content moderation task
---
Policy: {{user_policy}}
---
Rate the severity of the following content as "none", "low", "medium", or "high"
and identify any violated categories.

Content: "John Smith lives at 742 Evergreen Terrace, Springfield.
His SSN is 123-45-6789 and his bank account number is 9876543210."

Assistant:


Severity: High

Violated category: Privacy

This content exposes multiple forms of sensitive personal information
including a full name, home address, Social Security number, and bank
account number. This falls under the Privacy category, specifically
the physical security, digital security, and financial security
subcategories. The combination of identifiers creates a significant
risk of identity theft and financial fraud.

Warnung JavaScript ist in Ihrem Browser nicht verfügbar oder deaktiviert.

Zur Nutzung der AWS-Dokumentation muss JavaScript aktiviert sein. Weitere Informationen finden auf den Hilfe-Seiten Ihres Browsers.

Dokumentkonventionen

Prompt-Ansätze zum visuellen Verständnis

Allgemeine Prompting-Tipps